Answer Set Programming for Non-Stationary Markov Decision Processes

05/03/2017 ∙ by Leonardo A. Ferreira, et al. ∙ CSIC Universidade Metodista de São Paulo RoboFEI 0

Non-stationary domains, where unforeseen changes happen, present a challenge for agents to find an optimal policy for a sequential decision making problem. This work investigates a solution to this problem that combines Markov Decision Processes (MDP) and Reinforcement Learning (RL) with Answer Set Programming (ASP) in a method we call ASP(RL). In this method, Answer Set Programming is used to find the possible trajectories of an MDP, from where Reinforcement Learning is applied to learn the optimal policy of the problem. Results show that ASP(RL) is capable of efficiently finding the optimal solution of an MDP representing non-stationary domains.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

John McCarthy defined Elaboration Tolerance as “the ability to accept changes to a person’s or a computer program’s representation of facts about a subject without having to start all over” McCarthy98 . An example of a real world problem that requires solutions that are tolerant to elaborations is the dynamics of urban mobility, where streets and roads are constantly reconstructed or modified. Some of these changes are planned and, thus, can be previously informed to the inhabitants of the city. However, unplanned changes due to natural phenomena (rain or snowing, for example), or due to human actions (e.g. road accidents), may occur that cause road blocks which could prevent the traffic through certain routes of the city. In such cases, it is not possible to know the changes until they are observed by the agents. However, an agent immersed in this domain must be capable of finding the best sequence of actions, considering the new situations, but without loosing all the information previously acquired.

One formalism that can be used to model the kind of situations described above is a non-stationary Markov Decision Process (MDP), where the set of states represented by observations of the environment (facts) can suffer changes over time such that states can be added to, or removed from, the decision process. As these changes may not be known a priori

, the environment cannot be modelled as a stationary MDP due to the Curse of Dimensionality 

Bellman-Dreyfus1962 , which describes the growth in the set of states when considering the number of variables involved in the description of a state.

This work is directed towards problem solving in non-stationary domains in which, not only the transition and the reward functions change, but also the states and actions may change during the agent’s interaction with the environment. The ASP(RL) proposed here is able to change an MDP’s description during learning and to reuse the learnt data in the new domain that it is interacting with. A consequence of using ASP(RL) is the speed up in the searching for an MDP solution as a consequence of the reduction that may occur in the search space.

In order to model an agent capable of interacting efficiently with non-stationary domains, we propose a method called ASP(RL) that combines Markov Decision Process, Reinforcement Learning (RL) (Section 2.1) with Answer Set Programming (Section 2.2). The proposed combination (Section 3) allows an agent to learn incrementally in an environment that suffers changes. The method was analysed in a non-stationary grid world (Section 4) and experimentally evaluated and compared to two Reinforcement Learning algorithms (Section 5).

2 Background

This section introduces Markov Decision Processes (MDP), Reinforcement Learning (RL) and Answer Set Programming (ASP), which constitute the foundations of this work.

2.1 MDP and Reinforcement Learning

In a Sequential Decision Making Problem, an agent must select a series of actions in order to find a solution to a given problem. A feasible solution, known as policy (), is a sequence of non-deterministic actions that leads the agent from an initial state to a goal state Bellman1952 ; Bellman-Dreyfus1962 . A problem such as this may have more than one feasible solution, thus it is possible to use the Bellman’s Principle of Optimality Bellman1952 ; Bellman-Dreyfus1962 as a criterion to define which of the feasible policies can be considered as the optimal policy (). Bellman’s Principle of Optimality states that “an optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision” Bellman-Dreyfus1962 . By this definition, an optimal policy is the one that maximises (or minimises) a desired reward/cost function.

Markov Decision Process (MDP) Bellman1957 can be used to formalise Sequential Decision Making Problems. An MDP is defined as a tuple where:

  • is the set of states at any time step;

  • is the set of allowed actions in the states ;

  • is the transition function that gives the probability of reaching the future state

    by performing action in the current state ;

  • is the reward function that returns a real value for reaching a state after performing an action in a state .

To find the optimal solution of an MDP is to find, for each state, which is the action that maximises the reward function. One of the methods that can be used to approximate such optimal solution is Reinforcement Learning (RL). With RL, at each time step, a learning agent at a state chooses an action to be performed in the environment. After the action is performed, the agent receives its new state and a reward . This reward is used to update a value function (or an action-value function , depending on the method used) and the interaction continues from the new state. Given enough time, the agent is capable to approximate the (action-) value function, maximising the reward function and finding the optimal policy. One important aspect of RL methods is that the transition and reward function are not necessarily known beforehand by the agent, but are present in the environment.

Two well-known methods of RL are SARSA Sutton-Barto15 and Q-Learning Watkins ; Sutton-Barto15 . Both are based on the concept of updating an (action-) value function considering the observations received from the environment. The main difference between them is how this update is accomplished. SARSA is an on-policy method, which means that updates in the function use the actions executed in the policy that is being followed, while Q-Learning is an off-policy method that uses the maximum value of the next state to update the current state-action pairs.

Although Reinforcement Learning allows for learning the optimal solution of a sequential decision-making problem with non-stationary transition and reward functions (functions that may change over time) and without the knowledge of the reward function, it still needs stationary sets (which do not change during the interaction) of states and actions in order to proceed with the learning process. In order to account for changes in the set of states, we propose the use of Answer Set Programming.

2.2 Answer Set Programming

Answer Set Programming (ASP) is a declarative non-monotonic logic programming language that has been used with great success to describe and provide solutions for NP-complete problems, such as planning and scheduling 

khandelwal2014 ; yang2014 . Furthermore, ASP can be used for problems with large search space, such as the Reaction Control System of a Space Shuttle RCS ; RCS1 ; RCS2 ; RCS3 .

An ASP program is a set of rules, each rule is composed of an atom and of literals , which are atoms or negated atoms. An ASP rule can be represented as: where is called the head of the rule and the conjunction of literals is its body. A rule is said to be positive when there is no negated atom in its body; when the atom is said to be a fact.

Let be an ASP program, an answer set of is an interpretation that makes all the rules of this program true. This interpretation is a minimal model of the program. One important aspect of ASP is its non-monotonic semantics (based on the Stable Model Semantics Gelfond88 ), which respects the rationality principle that states that “one shall not believe anything one is not forced to believeGelfond88 . Along with and , ASP also has a third truth value for .

There are two types of negation in ASP: strong (or “classical”) and weak, which in ASP represents negation as failure Lifschitz1999 .

Given an ASP program and a set of atoms of , a reduct program is obtained from by Gelfond88 :

  • Deleting each rule with a negative literal in its body in the form ;

  • Deleting every negative literal in the body of remaining rules.

Thus, the reduct program is negation-free and has a unique minimal Herbrand model. If coincides with this model for , then is a stable model of . Furthermore, by using an operator defined as “for any set of atoms of , is the minimal Herbrand model of ”, then a stable model can also be described as the fixed points of . From this definition, a minimal model that accepts classical negation is called an answer set instead of a stable model.

Although ASP does not provide syntax to describe non-deterministic events, it is possible to use choice rules in order to verify each possible outcome of a choice. Considering for example that an agent is at a state s0 and chooses to perform action a with the possible outcomes being the future states s1, s2 and s3, this transition can be encoded using ‘‘1 { s1, s2, s3 } 1 :- s0, a.’’ in an ASP program. Thus, when s0 and a are true in (the agent has performed the action a in the state s0), only one of the future states s1, s2 or s3 is true (reached by the agent).

Since ASP can be used as a tool for providing reasoning and knowledge revision on a set of states and Reinforcement Learning allows for learning the solution of an MDP without the need of an explicit reward function, an opportunity arises to combine both methods in order to efficiently find the optimal policies for domains where unforeseen changes occur. The next section presents the action language that provides the appropriate definitions for domain modelling needed to bridge the gap between ASP and RL.

2.3 The Action Language

The action language is defined over the stable model semantics and allows for some useful ASP constructs, such as a high-level description of actions and their effects, as a consequence of its structured abstract representation of transition systems BCplus2015 .

has two sets of symbols: action constants and fluent constants; and also two sets of formulas: fluent formula, which has only fluent constants, and action formula, which has at least one action constant and no fluent constant.

In , an action description is a set of causal laws that have two forms. The first is:

(1)

where, and are formulas. If and are both fluent formulas, then Formula 1 is a static law. If is an action formula, but is a fluent formula, then Formula 1 is an action dynamic law. The second form is called fluent dynamic law and has the form:

(2)

where is a formula, and are fluent formulas and does not contain statically determined constants.

Causal dependencies between fluents in the same state are described by static laws. Direct effects of actions are represented by fluent dynamic laws, while causal dependencies between concurrently executed actions are expressed by action dynamic laws.

Given an action description expressed in , a stable model for the sequence of propositional formulae describes a path of length in a transition system  BCplus2015 . Given a time instant , a translation is a conjunction of:

  • for every static and atomic law in and ;

  • for every fluent dynamic law in and ;

  • for every regular fluent constant and every ;

  • Given as , for every representing the uniqueness of names and existence values for the constants;

The action language can be directly translated into an ASP program for providing sequences of actions as answer sets.

3 Combining ASP and MDP

This section presents the main contribution of this work, the ASP(RL) method, which is a combination of ASP and MDP for solving non-stationary decision making problems.

3.1 Finding the Set of States

In this work Answer Set Programs, translated from , represent the states , the actions , and the expected transition function of an MDP, along with sets and which represents the sets of initial states and goal states respectively. Let be one such ASP program with and as set of states and actions respectively. Given an initial state and a goal state , an answer set of represents a trajectory of the form:

(3)

where and are, respectively, the state and the action at time .

As ASP programs can have more than one answer set, let a set contain all trajectories that represent the sequence of actions leading from an initial state to a goal state. Thus, in the set of trajectories there are a set of states visited and a set of actions performed that are subsets of those sets in the MDP defined in the logic program . Thus, this set can be used to describe a new MDP , as stated in the following Lemma.

Lemma 1

Given an MDP described by a logic program , the set of trajectories found for defines , such that . Considering that iff or or or .

Proof

(Sketch) A logic program defines a set of restrictions on an MDP. These restrictions are a set of states that the agent may not be able to visit and a set of actions that the agent may not be able to perform. Also, changing actions or states imply changing the transitions as well. Thus, and .

The transition function is then described considering the following conditions:

  1. The agent cannot visit a state that is forbidden: ;

  2. The agent cannot perform a forbidden action: ;

  3. The agent cannot perform a forbidden action in a state that it cannot visit: ;

  4. The agent cannot visit states that have no transition probabilities: ;

  5. The agent is not allowed to perform some specific actions in some specific states: .

Thus, the transition function that is extracted from the answer sets is defined as:

(4)

When an MDP is not deterministic, choice rules are used to describe the transition possibilities (without the probability itself), a similar process is used to find the transition function .

Therefore, with this new set of states , actions and transition function , it is possible to formalise an MDP in the form . Since the reward comes from the interaction with the environment, there is no need to suppress any value in this function or even to know which is the reward function beforehand.

Once it is possible to formalize an MDP that is a subset of another MDP , it is still necessary to guarantee that the optimal solution of is the optimal solution of of as stated in Theorem 3.1.

Theorem 3.1

Given a reward function and an evaluation criteria (i.e. maximasing or minimising rewards), the optimal solution for the MDP is equivalent to the optimal solution for the MDP given the answer sets (trajectories) found as solutions to the logic program that represents .

Proof

Both and have to maximise (or minimise) the same reward function . If there is no restrictions in the set of states () and actions (), we have that and .

If there are restrictions represented in , then and the feasible solutions (answer sets) for are the same of those for (by using Lemma 1). Since the optimal solution must be a feasible solution, then and . Thus, given the same set of feasible solutions and the same evaluation criteria, .

3.2 The Algorithm ASP(RL)

Lemma 1 and Theorem 3.1 support the use of ASP to find the sets of states and actions of an MDP. By using RL it is possible to find an optimal stochastic solution to this MDP. Since ASP allows for revisions to be made in the set of states and actions, if it is the case that the environment changes at any time step, it can be used to find the new subsets of states and of actions of the modified MDP and values learnt from the previous interaction can be used as input for this new MDP. Algorithm 1 is the pseudocode of ASP(RL), that uses the non-monotonicity of ASP along with the exploratory nature of RL algorithms in stochastic domains.

1 Algorithm:  ASP(RL)
Input: An MDP descried as a logic program and a (optional) function to be approximated.
Output: The approximated function.
2 Find the answer sets for . Update function using and found in . while the environment does not change do
3      Approximate using a RL method.
4 end while
Include the observed changes in Call ASP(RL) with and the function approximated.
Algorithm 1 ASP(RL) Algorithm.

Algorithm 1 uses RL methods for approximating the function for the states and actions obtained by ASP. First, the domain is described as a logic program , using the vocabulary, and answer sets are found for it. From those answer sets (as shown in Lemma 1) the sets of states and actions are constructed for the MDP that will be used by the agent to interact with the environment, along with the transition function . Once the MDP is formalised, the interaction with the environment and the search for the optimal solution begins by using any RL algorithm. This interaction continues until a change in the environment happens. At this instant, the algorithm returns the approximated .

The algorithm works in non-stationary environments by including the observed environment changes in so that ASP can be used again to find the new sets of states and actions along with the transition function. Since there is a function approximated from the previous interaction, modifications are performed in it. The state-action pairs that are in the new set of answer sets are added to the action-value function and the pairs that are not in this set are removed. The state-action pairs that were in the function, and that are also in the answer set, remain in the action-value function with the previously learned value. Therefore, the interaction with a changing environment is done by calling ASP(RL) with , augmented with the observed changes, and the action-value function returned by the previous call.

4 Experiments

Experiments were performed in a non-deterministic non-stationary grid world of size which allowed the execution of only one of four actions each time: go up, go down, go left and go right. The probabilities for the environment were defined as 80% for the transition to happen as expected (e.g., executing go up makes the agent go up with 80% of probability) and a 20% chance for the agent to go orthogonal to the desired direction (e.g., executing go up may make the agent to go left or right with 10% of chance for each side).

The grid world may have walls (W) and holes (H) each of which occupies a single cell of the grid. When the agent performs an action and hits a wall, it stays in the same state; when it executes an action and falls into a hole, the episode ends. In this domain, an agent that starts in the lowermost, leftmost, cell has as a goal to reach the topmost, rightmost, cell. The reward function used in this domain is for reaching the goal, for falling in a hole and in any other event. It is important to notice that the transition function and reward function are unknown to the agent. For this grid world, the representation used by the agent is the value of its position in X and Y. These values are not treated by the agent as an X by Y matrix, but as a set of atoms in the form for each pair of X and Y values found in an answer set.

(a) Map 1: Initial configuration.
(b) Map 2.
(c) Map 3.
(d) Map 4.
Figure 1: Grid worlds used in the experiments. Squares labelled with W represent walls and with H, holes.

This grid world suffers changes in a manner that is previously unknown to the learning agent. In this work, ASP(RL) is evaluated in three distinct situations, in each of them the agent starts in the map shown in Figure 0(a) that, after 5000 episodes, changes to one of the other maps in Figure 1. For this work, changes observed in the environment were manually entered in the logic program. Nevertheless, this can be automatically done by using an online method with ASP.

The map in Figure 0(a) represents a grid world with no walls or holes. In this case, any combination of actions that makes the agent to go up and right leads the agent to the goal. Figure 0(b) represents a grid world with two walls and two holes. Figure 0(c) shows a grid world containing more walls and holes than in the previous situation. In this case, the agent has fewer action options to achieve the goal state. Finally, Figure 0(d) represents a grid world in which there is only one policy for achieving the goal state with the minimum number of actions. Any other policy for this grid world will necessarily make the agent hit a wall before reaching the goal state.

The arrows in the maps shown in Figure 1 represent the feasible policies obtained by ASP with the minimum number of steps. Note that, these policies do not represent the transition probabilities of the environment.

In the first situation, the environment changes from the map in Figure 0(a) to that in Figure 0(b). In this case, we can see that there is a reduction in the number of policies with the minimum number of steps.

In the second situation, the change occurs from the map shown in Figure 0(a) to that in Figure 0(c). By analysing the arrows in the final (Fig. 0(c)) grid world, we can see that there is an even greater reduction in the number of policies than in the previous situation, since there are more walls and holes in this map, which imply fewer safe actions (arrows) available.

In the final situation the environment changes from the map in Figure 0(a) to the one in Figure 0(d). In this case the MDP has only one optimal solution. This situation was chosen since the answer set provides the only optimal solution almost instantly, whereas in the case where an action-value function is approximated by RL (without using the answer sets), every possible action in every possible state is considered, leading to a costly search procedure.

5 Results

In this section we use the situations described above to compare the learning processes of SARSA and Q-Learning with those of ASP(SARSA) and ASP(Q-Learning), which are ASP(RL) methods where SARSA and Q-Learning are used along with ASP. This comparison is accomplished with two different criteria: the return () of the episode and the number of steps needed to reach the goal state and root-mean-square deviation (RMSD) of the action-value function, at time wrt time , according to the equation 5 below.

(5)
(a) Steps needed to reach the goal state.
(b) Total returns receive per episode.
Figure 2: Results for the first situation for every episode.
(a) Steps needed to reach the goal state.
(b) Total returns receive per episode.
Figure 3: Results for the first situation.
(a) Steps needed to reach the goal state.
(b) Total returns receive per episode.
Figure 4: Results for the second situation for every episode.
(a) Steps needed to reach the goal state.
(b) Total returns receive per episode.
Figure 5: Results for the second situation.
(a) Steps needed to reach the goal state.
(b) Total returns receive per episode.
Figure 6: Results for the third situation for every episode.
(a) Steps needed to reach the goal state.
(b) Total returns receive per episode.
Figure 7: Results for the third situation.

The graphs in Figures 2, 3, 4, 5, 6 and 7 represent measurements for the four algorithms applied in the three situations considered. Figures 3, 5 and 7 depict the results of the first 50 episodes for the first map and then skipping to the 5000th episode directly, in order to present the measurements after the environment change occurs111Episodes 51st to 4999th were removed from the figures.. The respective results for every episode are shown in Figures 2, 4 and 6.

The results for the number of steps from the first situation are presented in Figure 2(a). Figure 2(b) presents the returns. In the first map (Figure 0(a)) all four algorithms present the same number of steps and returns during the initial 50 episodes shown. After the 5000th episode, the number of steps of ASP(Q-Learning) and ASP(SARSA) decrease faster than that of Q-Learning and SARSA, while the returns of ASP(Q-Learning) and ASP(SARSA) increase faster than Q-Learning and SARSA. This difference in the performance of ASP(RL) and RL algorithms after the change occurs in the environment is due to the fact that ASP(RL) reuses the function approximated in the previous map.

For the second situation, Figure 4(a) presents the number of steps and Figure 4(b) the returns for the four algorithms. Regarding the number of steps, it is possible to notice that although ASP(Q-Learning) and ASP(SARSA) use the information acquired from previous experience, they still need the same number of steps as Q-Learning and SARSA in all episodes. However, the returns for ASP(Q-Learning) and ASP(SARSA) are higher than the returns from Q-Learning and SARSA when the change in the map occurs (5000th episode). This similarity in the number of steps for the four algorithms is due to the great change that occurred in the environment, thus ASP(Q-Learning) and ASP(SARSA) still need to learn interactively with the new environment, even though they use information from the previous map.

The number of steps and returns for the third situation are presented in Figure 6(a) and 6(b) respectively. In both returns and steps, when the change occurs, the use of previously learned values enhance the performance of ASP(Q-Learning) and ASP(SARSA). While there is a slow decrease in the number of steps in Q-Learning and SARSA and a slow increase in the returns, ASP(Q-Learning) and ASP(SARSA) can quickly learn the only optimal policy since this policy is already known from previous map.

Experiments were performed in a 1.66GHz Core2Duo with 4GB of RAM running Debian 9 (currently the testing version). Logic programs were written in  BCplus2015 and translated to ASP language using CPLUS2ASP babb2013 , which uses iClingo ASP2012 to find answer sets. For finding the optimal solution, Q-Learning and SARSA were implemented in Python 3.5 using only built-in libraries. Thirty training sessions were executed for each algorithm. The same parameters were used in all the experiments: learning rate , discount factor , exploration/ exploitation rate and the Q table was randomly initialised.

6 Discussion

The results shown in the previous section, present the best, worst and average cases of the ASP(RL) method proposed.

The first map (Figure 0(a)) represents the worst case for ASP(RL). As can be seen in the graphs in Figures 2, 3, 4, 5, 6,7, the performance of ASP(Q-Learning) and ASP(SARSA) are the same as that of Q-Learning and SARSA. This is due to the fact that the reduction in the sets of states and actions are minimal (since there isn’t any restriction in this map) and ASP(RL) methods use the same and as an RL method.

The best case is represented in the last map (Figure 0(d)). In this case, there is only one feasible policy and, thus, this is the optimal policy. Although the learning process has been executed, in situations like these learning is not necessary, since there is only one feasible policy that is provided by an answer set.

A similar case occurs when there is no feasible policy. In this situations there is also no need to perform the learning process, since it is already known from the answer sets that there is no feasible/optimal policy and the problem cannot be solved.

The average case is presented in the second and third maps (Figures 0(b) and 0(c) respectively). In these situation it is possible to notice that there is a reduction in the sets of states and actions, along with a reduction in the search space. Nevertheless, the acceleration in the learning process depends on how much the environment has changed from the previous situation. For example, the gain in learning time in the second situation (Figures 4 and 5) is greater than that of the third situation (Figures 6 and 7).

ASP(RL) was not only capable of dealing with non-stationary non-deterministic environments but it also provides the possibility to reduce the search space, thus finding the optimal solution in fewer interactions with the environment than using RL alone. This reduction in the search space is related to the problem that is being solved and not only to the method proposed.

7 Related Work

The method proposed in this paper is in line with the work reported in Mohan2015IEEE ; Mohan2015WHR where ASP is used to find a description of the domain and RL is applied in the search for the optimal solution. Although both proposals combine similar tools, their use differ. While the present work formalises an MDP from the answer sets, the method proposed in Mohan2015IEEE ; Mohan2015WHR finds only one answer set for the problem, where each atom in this set defines a hierarchical POMDP that has to be solved.

A related approach is the combination of ASP with action costs khandelwal2014 ; yang2014 . Although this method also uses a logic program to describe the domain, it uses a method different from RL to find the action costs. At each action that is executed, the agent finds new plans to reach the goal; the update of the state-action pair’s value is not based on the Temporal Difference method of RL.

Another work that also deals with sequential decision making is P-Log p-log ; Baral-PLog which calculates transition probabilities from sampling the environment, but without considering the cost of performing an action. The present work differs from P-Log in that our goal is to find the optimal solution regarding not only the transition probabilities, but also the action costs.

Also related to our work is Saturated Path-Constrained MDP (SPC-MDP) Kolobov2014 . In a SPC-MDP, a solution is found by a constraint satisfaction procedure. This closely relates to the results obtained with the use of ASP to define the set of states for an MDP as proposed in this paper. However, while the approach described in Kolobov2014 uses a Dynamic Programming algorithm to find the solutions, ASP(RL) uses the interaction with the environment in order to approximate the action-value function in non-stationary decision making problems, which (to the best of our knowledge) has never been attempted before.

Works that are somewhat related to our approach, but can be used when searching for the optimal policy are the ones that deals with changing reward functions, such as Experts ; oMDP ; ArbRew . Since ASP(RL) uses RL, changes in the reward function are learned by the agent and does not affect the algorithm. Another approach that is somewhat related to ASP(RL) is hierarchical MDPs (such as the works of RL-TOPS ; AbsBeh ), which can also be incorporated such as the method proposed by Mohan2015IEEE ; Mohan2015WHR described in the beginning of this section. Although the decomposition proposed by hierarchical MDPs provide more abstraction when searching for the solution, ASP(RL) deals with changes in and such as the number of states and actions available in the environment or their representation.

To the best of our knowledge, these are the only works related to our method in which the focus is in the change of the sets of states and actions, not only in the transition and reward functions, nevertheless comparison with these methods is not possible since their goals and results differs from ASP(RL).

8 Conclusion

This paper presented a method for efficiently solving non-stationary Markov Decision Processes (MDP). The proposed approach, called ASP(RL), uses a combination of Answer Set Programming (ASP) and Reinforcement Learning (RL) in which ASP provides the set of states and actions in domains where unforeseen changes may happen, while RL is used to approximate a value-action function by means of interactions with the environment. In ASP(RL), Answer Set Programming is used as a tool for reasoning and knowledge revision and Reinforcement Learning allows for learning the solution of an MDP without the need of an explicit stationary reward function.

Experiments were performed in a changing grid world, whose results show that the use of ASP to find the set of states and actions effectively reduces the search space for finding optimal policies of Markov Decision Processes in complex domains, as well as in domains that allow only a few possible policies. Not only ASP(RL) allowed a faster approximation of the action-value function (compared to standard RL algorithms), but the process could continue to interact in a changing environment indefinitely.

ASP(RL) is capable of dealing with unforeseen changes in the domain, thus solving non-stationary decision making problems. To the best of our knowledge, this has never been accomplished before.

Future work shall be directed towards a full integration of RL into the ASP engine, facilitating the use of ASP when new states appear in a non-deterministic environment, with the possibility of reviewing the whole set of states seamlessly.

Acknowledgements.
Leonardo A. Ferreira was partially funded by CAPES. Reinaldo A. C. Bianchi acknowledges the support of FAPESP (grants 2011/19280-8 and 2016/21047-3). Paulo E. Santos acknowledges the support of CNPq (grants 307093/2014-0 and 473989/2013-1). Ramon Lopez de Mantaras acknowledges the support of Generalitat de Catalunya (project 2014-SGR-118) and CSIC (project NASAID 201550E022).

References

  • (1) Babb, J., Lee, J.: Cplus 2ASP: Computing action language + in answer set programming. In: P. Cabalar, T.C. Son (eds.) Logic Programming and Nonmonotonic Reasoning, vol. 8148, pp. 122–134. Springer Berlin Heidelberg (2013)
  • (2) Babb, J., Lee, J.: Action language +. Journal of Logic and Computation (2015)
  • (3) Balduccini, M., Gelfond, M., Nogueira, M.: Answer set based design of knowledge systems.

    Annals of Mathematics and Artificial Intelligence

    47(1-2), 183–219 (2006)
  • (4) Balduccini, M., Gelfond, M., Nogueira, M., Watson, R.: Planning with the USA-Advisor. In: 3rd NASA International workshop on Planning and Scheduling for Space. Houston, Texas (2002)
  • (5) Balduccini, M., Gelfond, M., Watson, R., Nogueira, M.: The USA-Advisor. In: G. Goos, J. Harmanis, J. van Leeuwen, T. Eiter, W. Faber, M.l. Truszczyński (eds.) Logic Programming and Nonmotonic Reasoning, vol. 2173, pp. 439–442. Springer Berlin Heidelberg, Berlin, Heidelberg (2001)
  • (6) Baral, C., Gelfond, M., Rushton, N.: Probabilistic reasoning with answer sets. Theory and Practice of Logic Programming 9(1), 57 (2009)
  • (7) Bellman, R.: On the theory of dynamic programming. Proceedings of the National Academy of Sciences 38(8), 716–719 (1952)
  • (8) Bellman, R.: A Markovian decision process. Indiana University Mathematics Journal 6(4), 679–684 (1957)
  • (9) Bellman, R.E., Dreyfus, S.E.: Applied dynamic programming, 4 edn. Princeton Univ. Press (1971)
  • (10) Even-dar, E., Kakade, S.M., Mansour, Y.: Experts in a markov decision process. In: L.K. Saul, Y. Weiss, L. Bottou (eds.) Advances in Neural Information Processing Systems 17, pp. 401–408. MIT Press (2005)
  • (11) Even-Dar, E., Kakade, S.M., Mansour, Y.: Online markov decision processes. Mathematics of Operations Research 34(3), 726–736 (2009)
  • (12) Gebser, M., Kaminski, R., Kaufmann, B.: Answer set solving in practice. Morgan & Claypool Publishers (2013)
  • (13) Gelfond, M., Lifschitz, V.: The stable model semantics for logic programming. In: R. Kowalski, Bowen, Kenneth (eds.) Proceedings of International Logic Programming Conference and Symposium, pp. 1070–1080. MIT Press (1988)
  • (14) Gelfond, M., Rushton, N.: Causal and probabilistic reasoning in P-log. Heuristics, Probabilities and Causality. A tribute to Judea Pearl pp. 337–359 (2010)
  • (15) Khandelwal, P., Yang, F., Leonetti, M., Lifschitz, V., Stone, P.: Planning in action language BC while learning action costs for mobile robots. In: Proceedings of the Twenty-Fourth International Conference on Automated Planning and Scheduling, ICAPS 2014, Portsmouth, New Hampshire, USA, June 21-26, 2014 (2014)
  • (16) Lifschitz, V.: Answer set programming and plan generation. Artificial Intelligence 138(1–2), 39 – 54 (2002)
  • (17) McCarthy, J.: Elaboration tolerance. In: Proc. of the Fourth Symposium on Logical Formalizations of Commonsense Reasoning (Common Sense 98), vol. 98. London, UK (1998)
  • (18) Nogueira, M., Balduccini, M., Gelfond, M., Watson, R., Barry, M.: An A-Prolog decision support system for the space shuttle. In: G. Goos, J. Hartmanis, J. van Leeuwen, I.V. Ramakrishnan (eds.) Practical Aspects of Declarative Languages, vol. 1990, pp. 169–183. Springer Berlin Heidelberg, Berlin, Heidelberg (2001)
  • (19) Ryan, M.R., Pendrith, M.D.: Rl-tops: An architecture for modularity and re-use in reinforcement learning.

    In: In Proceedings of the Fifteenth International Conference on Machine Learning, pp. 481–487. Morgan Kaufmann (1998)

  • (20) Ryan, M.R.K.: Using abstract models of behaviours to automatically generate reinforcement learning hierarchies. In: In Proceedings of The 19th International Conference on Machine Learning, pp. 522–529. Morgan Kaufmann (2002)
  • (21) Sprauel, J., Teichteil-Königsbuch, F., Kolobov, A.: Saturated path-constrained MDP: Planning under uncertainty and deterministic model-checking constraints. In: Proc. of 28th AAAI Conf. on Artificial Intelligence (AAAI), pp. 2367–2373 (2014)
  • (22) Sridharan, M., Gelfond, M., Zhang, S., Wyatt, J.: Mixing non-monotonic logical reasoning and probabilistic planning for robots. In: Workshop on Hybrid Reasoning @ IJCAI 2015 (2015)
  • (23) Sutton, R.S., Barto, A.G.: Reinforcement learning an introduction – Second edition, in progress (Draft). MIT Press (2015)
  • (24) Watkins, C.J.C.H.: Learning from delayed rewards. PhD thesis, University of Cambridge England (1989)
  • (25) Yang, F., Khandelwal, P., Leonetti, M., Stone, P.: Planning in answer set programming while learning action costs for mobile robots. In: AAAI Spring 2014 Symposium on Knowledge Representation and Reasoning in Robotics (AAAI-SSS) (2014)
  • (26) Yu, J.Y., Mannor, S., Shimkin, N.: Markov decision processes with arbitrary reward processes. Mathematics of Operations Research 34(3), 737–757 (2009)
  • (27) Zhang, S., Sridharan, M., Wyatt, J.L.: Mixed logical inference and probabilistic planning for robots in unreliable worlds. IEEE Transactions on Robotics 31(3), 699–713 (2015)