Learning Non-Markovian Reward Models in MDPs

by   Gavin Rens, et al.
Université Libre de Bruxelles

There are situations in which an agent should receive rewards only after having accomplished a series of previous tasks. In other words, the reward that the agent receives is non-Markovian. One natural and quite general way to represent history-dependent rewards is via a Mealy machine; a finite state automaton that produces output sequences (rewards in our case) from input sequences (state/action observations in our case). In our formal setting, we consider a Markov decision process (MDP) that models the dynamic of the environment in which the agent evolves and a Mealy machine synchronised with this MDP to formalise the non-Markovian reward function. While the MDP is known by the agent, the reward function is unknown from the agent and must be learnt. Learning non-Markov reward functions is a challenge. Our approach to overcome this challenging problem is a careful combination of the Angluin's L* active learning algorithm to learn finite automata, testing techniques for establishing conformance of finite model hypothesis and optimisation techniques for computing optimal strategies in Markovian (immediate) reward MDPs. We also show how our framework can be combined with classical heuristics such as Monte Carlo Tree Search. We illustrate our algorithms and a preliminary implementation on two typical examples for AI.


page 1

page 2

page 3

page 4


Online Learning of Non-Markovian Reward Models

There are situations in which an agent should receive rewards only after...

Specifying Non-Markovian Rewards in MDPs Using LDL on Finite Traces (Preliminary Version)

In Markov Decision Processes (MDPs), the reward obtained in a state depe...

Learning Task Automata for Reinforcement Learning using Hidden Markov Models

Training reinforcement learning (RL) agents using scalar reward signals ...

Exploiting Fast Decaying and Locality in Multi-Agent MDP with Tree Dependence Structure

This paper considers a multi-agent Markov Decision Process (MDP), where ...

Stochastic Processes with Expected Stopping Time

Markov chains are the de facto finite-state model for stochastic dynamic...

RUDDER: Return Decomposition for Delayed Rewards

We propose a novel reinforcement learning approach for finite Markov dec...

Simple Algorithmic Principles of Discovery, Subjective Beauty, Selective Attention, Curiosity & Creativity

I postulate that human or other intelligent agents function or should fu...

1 Introduction

Traditionally, a Markov Decision Process (MDP) models the probability distribution to go to a state

from the current state while taking a given action together with an immediate reward that is received while performing from . This immediate reward is defined regardless of the history of states traversed in the past. This immediate reward has thus the Markovian property. But many situations require the reward to depend on the history of states visited so far. A reward may depend on the particular sequence of (sub)tasks that has been completed. For instance, when a nuclear power plant is shut down in an emergency, there is a specific sequence of operations to follow to avoid a disaster; or in legal matters, there are procedures to follow which require documents to be submitted in a particular order, etc.

Learning and maintaining non-Markovian reward functions is useful for several reasons: Many tasks are described intuitively as a sequence of sub-tasks or mile-stones, each with their own reward (cf. Sec. 2.) Possibly, not all relevant features are available in state descriptions, or states are partially observable, making it necessary to visit several states before (more) accurate rewards can be dispensed [2, 17]. Automata (reward machines) are useful for modeling desirable and undesirable situations so that tracking and predicting beneficial and detrimental situations [1, 13, 10]. Actually, in practice, it can be argued that non-Markovian tasks are more the norm than Markovian ones.

In this work, we describe an active learning algorithm to automatically learn through experiments reward models defined as Mealy machines. A Mealy machine is a deterministic finite automaton (DFA) that produces output sequences that are rewards in our case, from input sequences that are state/action observations in our setting. We refer to such finite state reward models as Mealy Reward Machines (MRM). Our algorithm is based on a careful combination of Angluin’s active learning algorithm [3] to learn finite automata, testing techniques for establishing the conformance of (hypothesized) finite models and optimization techniques for computing optimal strategies in classical immediate reward MDPs. By using formal methods for automata inference for learning a reward machine, one can draw on the vast literature on formal methods and model-checking to obtain guarantees under precisely stated conditions.

Our contribution is to show how a non-Markovian reward function for a MDP can be actively learnt and exploited to play optimally according to this non-Markovian reward function in the MDP. We provide a framework for completely and correctly learning the underlying reward function with guarantees under mild assumptions. To the best of our knowledge, this is the first work which employs traditional automata inference for learning a non-Markovian reward model in an MDP setting.

Next, we discuss related work and then cover the necessary formal concepts and notations. In Section 4, we define our Mealy Reward Machine (MRM), which represents an agent’s non-Markovian reward model. Section 5 explains how an agent can infer/learn an underlying MRM and present one method for exploiting a learnt MRM. Section 6 reports on experiments involving learning and exploiting MRMs; we consider two scenarios. The last section concludes this paper and points to future research directions.

2 Related Work

There has been a surge of interest in non-Markovian reward functions recently, with most papers on the topic being publications since 2017. But unlike our paper, most of those papers are not concerned with learning non-Markovian reward functions.

A closely related and possibly overlapping research area is the use of temporal logic (especially linear temporal logic (LTL)), for specifying tasks in Reinforcement Learning (RL)

[1, 11, 12, 9, 10]. Building on recent progress in temporal logics over finite traces (LTL), the authors of [6] adopt linear dynamic logic on finite traces (LDL; an extension of LTL) for specifying non-Markovian rewards, and provide an automaton construction algorithm for building a Markovian model. The approach is claimed to offer strong minimality and compositionality guarantees.

An earlier publication that deserves to be mentioned is [4]. In this paper, the authors propose to encode non-Markovian rewards by assigning values using temporal logic formulae. The reward for being in a state is then the value associated with a formula that is true in that state. As it is the case in the present work, they considered systems that can be modeled as MDPs, but the non-Markovian reward functions is given and does not need to be learnt. The authors of that paper were the first to abbreviate the class of MDP with non-Markovian reward as decision processes with non-Markovian reward or NMRDP for short.

In [16], the author present the Non-Markovian Reward Decision Process Planner: a software platform for the development and experimentation of methods for decision-theoretic planning with non-Markovian rewards.

In another paper, the authors are concerned with both the specification and effective exploitation of non-Markovian reward in the context of MDPs [8]. They specify non-Markovian reward-worthy behavior in LTL. These behaviors are then translated to corresponding deterministic finite state automata whose accepting conditions signify satisfaction of the reward-worthy behavior. These automata accepting conditions form the basis of Markovian rewards that can be solved by off-the-shelf MDP planners, while preserving policy optimality guarantees. In that sense, it is similar to our framework.

None of the research mentioned above is concerned with learning non-Markovian reward functions. However, in the work of [17], an agent incrementally learns a reward machine (RM) in a partially observable MDP (POMDP). They use a set of traces as data to solve an optimization problem. “If at some point [the RM] is found to make incorrect predictions, additional traces are added to the training set and a new RM is learned.” Each trace is of the form where the are observations (as required for POMDP models). An RM is specified as the constraints of the optimization model. They use Tabu search to solve the optimization problem. They have a theorem stating “When the set of training traces (and their lengths) tends to infinity […], any perfect RM with respect to and at most states will be an optimal solution to the formulation LRM.” Here is a labeling function which assigns truth values to propositional symbols given an observation, action and next observation, and is an upper bound on the number of states in an RM. A perfect RM is simply an RM that makes the correct predictions for a given POMDP. Their approach is also active learning: If on any step, there is evidence that the current RM might not be the best one, their approach will attempt to find a new one. One strength of their method is that the the reward machine is constructed over a set of propositions, where propositions can be combined to represent transition/rewarding conditions in the machine. Currently, our approach can take only single observations as transition/rewarding conditions. However, they do not consider the possibility to compute optimal strategies using model-checking techniques.

Moreover, our approach is different to that of Toro Icarte et al. [17] in that ours is an active learning approach, where the agent is guided by the L* algorithm to answer exactly the queries required to find the underlying reward machine. It seems that the approach of Toro Icarte et al. does not have this guidance and interaction with the learning algorithm.

3 Formal Preliminaries

We review Markov Decision Processes (MDPs) and Angluin-style learning of Mealy machines.

An (immediate-reward) MDP is a tuple , where

  • is a finite set of states,

  • is a finite set of actions,

  • is the state transition function such that is the probability that action causes a system transition from state to state ,

  • is the reward function such that is the immediate rewards for performing action in state , and

  • the initial state the system is in.

A non-rewarding MDP (nrMDP) is a tuple without a reward function.

The following description is from [18]. Angluin [3] showed that finite automata can be learned using the so-called membership and equivalence queries. Her approach views the learning process as a game between a minimally adequate teacher (MAT) and a learner who wants to discover the automaton. In this framework, the learner has to infer the input/output behavior of an unknown Mealy machine by asking queries to a teacher. The MAT knows the Mealy machine . Initially, the learner only knows the inputs and outputs of . The task of the learner is to learn through two types of queries:

  • With a membership query (MQ), the learner asks what the output is in response to an input sequence . The teacher answers with output sequence .

  • With an equivalence query (EQ), the learner asks if a hypothesized Mealy machine with inputs and outputs is correct, that is, whether and are equivalent (). The teacher answers yes if this is the case. Otherwise she answers no and supplies a counter-example that distinguishes and (i.e., such that ).

The algorithm incrementally constructs an observation table with entries being elements from .

Two crucial properties of the observation table allow for the construction of a Mealy machine [18]: closedness and consistency. We call a closed and consistent observation table complete.

Angluin [3] proved that her algorithm is able to learn a finite state machine (incl. a Mealy machine) by asking a polynomial number of membership and equivalence queries (polynomial in the size of the corresponding minimal Mealy machine equivalent to ). Let be the size of the input alphabet (observations), be the total number of states in the target Mealy machine, and be the maximum length of the counter-example provided for learning the machine. Then the correct machine can be produced by asking maximum queries (using, e.g., the algorithm) [15].

In an ideal (theoretical) setting, the agent (learner) would ask a teacher whether is correct (equivalence query), but in practice, this is typically not possible [18]. “Equivalence query can be approximated using a conformance testing (CT) tool [14] via a finite number of test queries (TQs). A test query asks for the response of the [system under learning] to an input sequence, similar to a membership query. If one of the test queries exhibits a counterexample then the answer to the equivalence query is no, otherwise the answer is yes[18]. Vaandrager [18] cites Lee and Yannakakis [14] saying that a finite and complete conformance test suite does exist if we assume a bound on the number of states of a Mealy machine.

4 Modeling Non-Markovian Rewards

Figure 1: The agent has a task of finding the person with the treasure-map at m, then either (i) selling the map to a jeweller at j1 or j2 or (ii) collecting equipment at e or hiring a guide at g, then finding the treasure at t, and finally selling the treasure to one of the jewelers. The agent can continue process (ii) once it has the map. Default reward/cost . Blank cells contain observation by default.
Figure 2: A Mealy reward machine for the treasure map scenario. In the case that is in the input alphabet, every node has a self-loop with label ; not shown.

Running Example.

Consider, as a running example, an agent who might stumble upon a person with a map for a hidden treasure and some instruction on how to retrieve the treasure. The instructions imply that the agent purchase some specialized equipment before going to the cave marked on the treasure map. Alternatively, the agent may hire a guide who already has the necessary equipment. If the agent is lucky enough to find some treasure, the agent must sell it at the local jewelry traders. There are two jewelers. The agent can then restock its equipment or re-hire a guide, get some more treasure, sell it and so on. The agent could also sell the map to one of the jewelers without looking for the treasure. Unfortunately, the instructions are written in a coded language which the agent cannot read. However, the agent recognizes that the map is about a hidden treasure, and thus spurs the agent on to start treasure hunting to experience rewards and learn the missing information.

The reward behavior in this scenario is naturally modeled as an automaton where the agent receives a particular reward given a particular sequence of observations. There is presumably higher value in purchasing equipment for treasure hunting only after coming across a treasure map and thus deciding to look for a treasure. There is more value in being at the treasure with the right equipment than with no equipment, etc.

We shall interpret features of interest as observations: (obtaining) a map, (purchasing) equipment, (being at) a treasure, (being at) a jewelry trader, for example. Hence, for a particular sequence of input observations, the Mealy machine outputs a corresponding sequence of rewards. If the agent tracks its observation history, and if the agent has a correct automaton model, then it knows which reward to expect (the last symbol in the output) for its observation history as input. Figure 1 depicts the scenario in two dimensions. The underlying Mealy machine could be represented graphically as in Figure 2. It takes observations as inputs and supplies the relevant outputs as rewards. For instance, if the agent sees , (map), then , then , then the agent will receive rewards 10, then 60 and then 0. And if it sees the sequence , then it will receive the reward sequence .

We define intermediate states as states that do not signify a significant progress towards completion of a task. In Figure 1, all lighter-colored blank cells represent intermediate states. We assume that there is a default reward/cost an agent gets for entering intermediate states. This default reward/cost is fixed and denoted in general discussions. Similarly, the special null observation () is observed in intermediate states. An agent might or might not be designed to recognize intermediate states. If the agent cannot distinguish intermediate states from ‘significant’ states, then will be treated exactly like all other observations and it will have to learn transitions for observations (or worse, for all observations associated with intermediate states). If the agent can distinguish intermediate states from ‘significant’ states, then we can design methods to take advantage of this background knowledge. Our approximate active learning algorithm (Alg. 2) is an example of how the ability to recognize intermediate states can be taken advantage of.

Mealy Reward Machines

We introduce the Mealy Reward Machine to model non-Markovian rewards. These machines take a set of observations representing high-level features that the agent can detect (equivalent to the set of input symbols for ). A labeling function maps action-state pairs onto observations; is the set of nrMDP states and is a set of observations. The meaning of is that is observed in state reached via action . For Mealy Reward Machines, the output symbols for are equated with rewards in .

Definition 4.1 (Mealy Reward Machine)

Given a set of states , a set of actions and a labeling function , a Mealy Reward Machine (MRM) is a tuple , where

  • is a finite set of MRM nodes,

  • is the start node,

  • is a set of observations,

  • is the transition function, such that for and ,

  • is the reward-output function, such that for , and .

We may write and to emphasize that the functions are associated with MRM .

A Markov Decision Process with a Mealy reward machine is thus an NMRDP. In Figure 2, an edge from node to node labeled denotes that and .

In the following definitions, let , , , and . An interaction trace of length in an MDP represents an agent’s (recent) interactions with the environment. It has the form and is denoted . That is, if an agent performs an action at time in a state at time , then it reaches the next state at time where/when it receives a reward. An observation trace is extracted from an interaction trace (employing a labeling function) and is taken as input to an MRM. It has the form and is denoted . A reward trace is either extracted from an interaction trace or is given as output from an MRM. It has the form and is denoted . A history has the form and is denoted . We extend to take histories by defining inductively as

explains how an MRM assigns rewards to a history in an MDP.

A (deterministic) strategy to play in a nrMDP is a function that associates to all sequences of states of the action to play.

Given reward trace , discounted sum and mean payoff are defined as

Let be the discounted sum or mean payoff of an infinite reward trace generated by reward model . Then the expected discounted sum and mean payoff under strategy played in MDP from state is denoted as .

Being able to produce a traditional, immediate reward MDP from a non-Markovian rewards decision process is clearly beneficial: One can then apply all the methods developed for MDPs to the underlying NMRDP, whether to find optimal of approximate solutions. We produce an MDP from a non-reward MDP (nrMDP) and a Mealy reward machine by taking their product as defined next.

Definition 4.2

Given an nrMDP , a labeling function and an MRM , we define the synchronized product of and under as an (immediate reward) MDP , where , , if , else , , and .

The strategies in and are in bijection. The following proposition states that the expected value of an nrMDP with an MRM is equal to the expected value of their product under the same strategy.

Proposition 4.1

Given an nrMDP , a labeling function and an MRM , for all strategies for , we have that

Because memoryless strategies are sufficient to play optimally in order to maximize in immediate reward MDPs, and together with Proposition 4.1, we conclude that we can play optimally in under in a finite memory strategy (the memory required for ; but memoryless if viewed as ).

5 Learning Mealy Reward Machines

We make the important assumption that the environment and the agent’s state can be reset. Recall the Treasure-Map world. After receiving the treasure map, the agent might not find a guide or equipment. In general, an agent might not finish a task or finish only one version of a task or subtask. Our reset assumption allows the agent to receive map again and explore various trajectories in order to learn the entire task with all its variations. Resetting also sets the underlying and hypothesized MRMs to their start nodes. Of course, the agent retains the hypothesized MRM learnt so far. Resetting a system is not always feasible, however, we are interested in domains where an agent can be placed in a position to continue learning or repeat its task.

5.1 Learning and Exploitation

Problem Statement.

Given an nrMDP , a labeling function , an unknown MRM which needs to be learned, and a threshold , learn whenever possible, a finite memory strategy such that .

There are several ways to determine . One straightforward way is to deploy the agent in the environment in a pre-trial phase and set to the highest return achieved. Another way is to set dynamically: let be the highest return observed so far; set , where is the a measure of the change in and is a weighting factor related to the confidence that will keep changing by .

We observe that although an agent may find an optimal strategy for which , if the MDP is not strongly connected, the agent could end up in a region of the state space where it is impossible to gain rewards greater than . For this reason, we shall make the agent reset if it has not yet received

or more rewards after a fixed number of actions executed in an epoch. A high-level description of the process for solving the problem follows. For learning

, we consider an active learning scenario in which the agent can

  1. play a finite number of episodes to answer membership queries until the observation tree (OT) is complete by extracting reward traces from the appropriate interaction traces;

  2. construct new hypothesized MRM from OT and compute the optimal strategy for , as soon as OT becomes complete;

  3. start performing actions to complete its task, using until a counter-example to is discovered (experienced) if the expected value of is greater than , in which case, stops exploitation and go to step 1 (the learning phase); else use conformance testing techniques to refute if the expected value of is less than , and go to step 1;

  4. restart its task if a given number of actions have been played and its total rewards (w.r.t. ) is still less than .

Algorithm 1 is the general, high-level algorithm for an agent actively learning an MRM in an MDP, and acting optimally with respect to the currently hypothesized MRM . In the algorithm, “alive” implies that there is no specific stopping criterion. In practice, the user of the system/agent will specify when the system/agent should stop collecting rewards. Procedures getMQ, resolveMQ, buildRewardMachine and addCounterExample are part of the Mealy machine inference algorithm which we assume to be predefined.

The getExperience procedure hides much important detail. In this procedure, the agent performs actions in order to answer the membership query starting from state . There are potentially many ways for the agent to behave, from performing random exploration to seeking an optimal strategy to encounter the sequence of observations represented by . We leave the investigation of guarantees for strategies for reaching a in an MDP from a give state for the future. In the next section, we propose one reasonable, approximate strategy.

  Initialize observation table
  while alive do
     if  is not complete then
        if  then
              Increment actionsExecuted by 1 
              Update with
              if  and actsExtd ActsToExt then
              end if
           until  is a counter example to
        end if
     end if
  end while
Algorithm 1 Optimal Active Learning

Procedure conformanceTesting uses conformance testing techniques to obtain an interaction trace that shows a discrepancy between the underlying and , and use to resolve

On guarantees offered by our learning setting

The main guarantees offered by the building blocks of our algorithm are as follows.

First, let us comment on the guarantees offered by the algorithm in our particular context. As we already mentioned, the realization of membership queries through the execution of possibly multiple episodes in the MDP is already a non-trivial task that cannot be realized with absolute certainty. Indeed, if asks, as a membership query, for the reward obtained after the sequence of observations , we need to consider two cases. This sequence may not exist in the MDP , in that case, we can provide with an arbitrary reward because this reward will never be used when considering how to play in . . If such a sequence does exists, let be a sequence of actions that maximizes the probability , , of observing from the initial state of . If , repeating this sequence does guarantee that we will eventually observe and thus its associated reward, but only with probability one and not absolute certainty. Also, while in principle, conformance testing is complete when a bound on the number of states is known for , in our setting the completeness can only be ensured with probability one and not absolute certainty as in the classical setting.

Second, the computation of optimal strategies for , once an hypothesis has been formulated by the learning algorithm, can be solved exactly, both for the discounted sum and mean reward functions (using model-checking algorithms, [5], e.g.). As a consequence, if the learning part of our algorithm returns an hypothesis that is correct for every sequences of observations that can be executed in , then our algorithm computes an optimal strategy with certainty.

5.2 Approximate Learning and Exploitation

In this section, we propose an active learning process for an NMRDP agent with a Mealy reward machine, where the strategies played are likely suboptimal. This approach is more suited to domains where optimality (e.g. a guarantee that ) is not important, but where reactiveness is more important. It is also the algorithm (Alg. 2) we use for evaluation in Section 6.

Action Planning for Answering Membership Queries

Something we have not yet discussed in detail is what strategy an agent follows to answer a membership query, that is, how procedure getExperience in the algorithms is implemented. We now describe how the procedure is implemented for Algorithm 2. Whenever the agent receives a membership query, it starts planning and executing actions to reach a state where the first observation is made in . The reward received for entering is recorder. Immediately, the agent plans and executes actions to reach where the second observation is made. The reward received for entering is recorder. This process continues until has been observed and has been received (and recorded). is then given to (via resolveMQ) as answer to the current membership query.

in Algorithm 2 is instantiated as Monte Carlo Tree Search [7] planning. The planner employs reward function

where is the observation current being pursued by the agent and .

Action Planning for Exploitation

Once an hypothesis reward model is available, the agent can stop learning and start exploiting the model.

Assume that all the agent’s tasks can be accomplished within non-null observations. Let be all non-null observation sequences of length and be all reward sequences of length . We seek the sequence of observations that will maximize the sum of rewards it induces in . Let be all histories with action-state pairs. Let . We mean by the observation sequence induced by history . That is, . Then, we seek where This is the sequence returned by getGoodObsSeq in Algorithm 2.

With in hand, the agent simply plans (using ) and acts in order to experience . The sequence is not guaranteed to be optimal because it does not consider state transition probabilities. But it is sufficient for our implementation for our proof of concept evaluation. If, while exploiting, the agent experiences a trace that contradicts the hypothesized reward model, it should go back to the learning phase.

  Initialize observation table
  while alive do
     if  is not complete then
           for  do
                 Increment actsExtd by 1 
                 Update with
              until  or actsExtd actsToExt
           end for
        until  is a counter example to and actsExtd actsToExt
        if actsExtd actsToExt then
        end if
     end if
  end while
Algorithm 2 Approximate Active Learning

6 Experimental Evaluation

Approximate active learning (Alg. 2) was implemented.

6.1 Experiments with the Treasure-Map world

For learning and planning in the Treasure-Map world, the agent maintains an interaction-trace of length 6 (i.e., ). When Monte Carlo Tree Search is used, trajectories are simulated for 30 actions, and there are 100 trajectories per action being planned for. Every trial allows an agent 2000 exploitation actions; actions required for answering membership queries are free.

The agent can move north, east, west and south, one cell per action. To add stochasticity to the transition function, the agent is made to get stuck a percentage of the time when executing an action. We abbreviate the precision factor of actions as APF; for instance, if APF = , then the agent gets stuck of the time. The default reward is set to (the agent gets when performing an action and seeing ). This is also the output/reward for all self-looping transitions in the hypothesis MRM.

We measure the total rewards gained per trial (Return), number of attempts to membership queries (MQAs), number of counter-examples encountered (CEs), total amount of time used for learning (LT; in seconds), number of exploration epochs/resets per trial (Epochs) and total amount of time used for exploitation (ET). Tables 1 and 2 show the average results over ten trials of each of three choices for the action precision factor.

The results in Table 1 focus on learning. The exploitation strategy is to perform random actions. The return is thus not important for this experiment but is used as baseline for the experiments focusing on exploitation. In every trial, the MRM in Figure 2 is correctly learnt.

The results in Table 2 focus on exploitation. The exploitation strategy is to select actions with Monte Carlo Tree Search (MCTS) planning as described in Section 5.2. The return is thus important. Although the return varies with the action precision factor, the MRM in Figure 2 is correctly learnt in every trial. This is because the learning process depends mostly on answering membership queries; these queries are always answered, even if it takes longer to answer a query due to low APF.

APF MQAs LT (s) CEs Return
Table 1: Results focusing on the learning phase.
APF Return ET (s) Epochs CEs
Table 2: Results focusing on the exploitation phase.

6.2 Learning in the Cookie Domain

Figure 3: The Cookie Domain. The button in yellow room causes a cookie to appear randomly either in the blue or red room. Agent: purple triangle.
Figure 4: The Cookie Domain (Mealy) reward machine.

The Cookie Domain of [17] is depicted in Figure 3. The agent can press a button in the yellow room for a cookie to appear in either the blue or red room (which room is chosen randomly). If the button is pressed, a fresh cookie appears at random. The agent gets 1 reward for eating the cookie. This is a partially observable MDP (POMDP) because the agent cannot see where the cookie is (if there is one) until it is in the same room as the cookie.

In the original problem [17], the agent can move in the four cardinal directions and push the button. When the agent enters a room where a cookie is, it automatically eats it. We added an explicit eat action, which leaves crumbs in the room if the agent is in the same room as a cookie. All actions and observations are deterministic. They have a set of propositions which we interpret as observations. However, our particular observations are , meaning the agent is in the empty blue room, the blue room with a cookie, the blue room with crumbs, the empty red room, the red room with a cookie, the red room with crumbs, the yellow room with the button not pressed down and the yellow room with the button depressed.

We defined our labeling function to simply report what the case is in state (room color, cookie, crumbs), and if the agent is in the yellow room and is , then is returned, else is returned. The function returns if the agent is in the passage between the rooms.

The authors provide the reward machine depicted in Figure 4 as a “perfect RM”. We provided it as the underlying MRM for our experiment. For this experiment, we implemented getExperience simply as random exploration. We performed ten trials. Our agent learns the ’perfect (M)RM’ perfectly every time, requiring approximately one second per trial.

7 Conclusion

We proposed two frameworks for learning and exploiting non-Markovian reward models. The reward models are represented as Mealy machines. Angluin’s algorithm was employed to learn Mealy Reward Machines within a Markov Decision Process setting. The one framework was justified theoretically, while the other framework was based on an actual implementation that was used for evaluation and as a proof-of-concept. We found that the latter framework always learns the toy examples correctly, that is, after answering a finite number of membership queries as posed by the algorithm, within a reasonable time.

When in a (fully observable) MDP, an MRM might be used for avoiding detrimental situation that cannot be avoided in an MDP with a traditional immediate reward function. Observe that, in a sense, an MRM maps a sequence of actions and states visited to a particular reward; this cannot be done with a traditional immediate reward function. Consider, for instance, the state where a battery pack of an electric vehicle (EV) is finished being assembled. Imagine that there is a crucial process that must be avoided, else the battery pack could explode during use in the EV. Sequences of assembly including this dangerous process could be assigned a very low reward. If it can be shown that every sequence involving the bad process results in a reward less than some threshold, then any sequence resulting in a reward greater than the threshold is guaranteed to be safe.

In future, we would like to implement the framework in a more principled way, using the optimal approach as a starting point. We expect that larger underlying reward machines will require intelligent exploration strategies. It will be interesting to investigate the exploration-exploitation trade-off in the setting of non-Markovian rewards. We would also like to compare our work more closely to that of [17], but at the time of this research, their implementation was not available.


  • [1] M. Alshiekh, R. Bloem, R. Ehlers, B. Könighofer, S. Niekum, and U. Topcu, ‘Safe reinforcement learning via shielding’, in Proceedings of the Thirty-Second AAAI Conf. on Artif. Intell. (AAAI-18), pp. 2669–2678. AAAI Press, (2018).
  • [2] C. Amato, B. Bonet, and S. Zilberstein, ‘Finite-state controllers based on Mealy machines for centralized and decentralized POMDPs’, in Proceedings of the Twenty-Fourth AAAI Conf. on Artif. Intell. (AAAI-10), pp. 1052–1058. AAAI Press, (2010).
  • [3] D. Angluin, ‘Learning regular sets from queries and counterexamples’, Information and Computation, 75(2), 87–106, (1987).
  • [4] F. Bacchus, C. Boutilier, and A. Grove, ‘Rewarding behaviors’, in Proceedings of the Thirteenth Natl. Conf. on Artif. Intell., pp. 1160–1167. AAAI Press, (1996).
  • [5] C. Baier and J.-P. Katoen, Principles of Model Checking, MIT Press, 2008.
  • [6] R. Brafman, G. De Giacomo, and F. Patrizi, ‘LTL/LDL non-markovian rewards’, in

    Proceedings of the The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18)

    , pp. 1771–1778. AAAI Press, (2018).
  • [7] C. Browne, E. Powley, D. Whitehouse, S. Lucas, P. Cowling, S. Tavener, D. Perez, S. Samothrakis, and S. Colton, ‘A survey of Monte Carlo tree search methods’, IEEE Transactions on Computational Intelligence and AI, (2012).
  • [8] A. Camacho, O. Chen, S. Sanner, and S. McIlraith, ‘Non-Markovian rewards expressed in LTL: Guiding search via reward shaping (extended version)’, in Proceedings of the First Workshop on Goal Specifications for Reinforcement Learning, FAIM 2018, (2018).
  • [9] A. Camacho, R. Toro Icarte, T. Klassen, R. Valenzano, and S. McIlraith, ‘LTL and beyond: Formal languages for reward function specification in reinforcement learning’, in Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pp. 6065–6073, (2019).
  • [10] G. De Giacomo, M. Favorito, L. Iocchi, and F. Patrizi, ‘Foundations for restraining bolts: Reinforcement learning with LTL/LDL restraining specifications’, in Proceedings of the Twenty-Ninth International Conference on Automated Planning and Scheduling (ICAPS-19), pp. 128–136. AAAI Press, (2019).
  • [11] R. Toro Icarte, T. Klassen, R. Valenzano, and S. McIlraith, ‘Teaching multiple tasks to an RL agent using LTL’, in Proceedings of the Seventeenth Intl. Conf. on Autonomous Agents and Multiagent Systems, AAMAS-18, pp. 452–461. International Foundation for Autonomous Agents and Multiagent Systems, (2018).
  • [12] R. Toro Icarte, T. Klassen, R. Valenzano, and S. McIlraith, ‘Using reward machines for high-level task specificationand decomposition in reinforcement learning’, in

    Proceedings of the Thirty-Fifth Intl. Conf. on Machine Learning

    , ICML-18, pp. 452–461, (2018).
  • [13] J. Křetínský, G. Pérez, and J.-F. Raskin, ‘Learning-based mean-payoff optimization in an unknown MDP under omega-regular constraints’, in Proceedings of the Twenty-Ninth Intl. Conf. on Concurrency Theory (CONCUR-18), pp. 1–8, Schloss Dagstuhl, Germany, (2018). Dagstuhl.
  • [14] D. Lee and M. Yannakakis, ‘Principles and methods of testing finite state machines - a survey’, Proceedings of the IEEE, 84(8), 1090–1123, (Aug 1996).
  • [15] M. Shahbaz and R. Groz, ‘Inferring Mealy machines’, in Proceedings of the International Symposium on Formal Methods (FM-09), eds., A. Cavalcanti and D. Dams, number 5850 in LNCS, 207–222, Springer-Verlag, Berlin Heidelberg, (2009).
  • [16] S. Thiébaux, C. Gretton, J. Slaney, D. Price, and F. Kabanza, ‘Decision-theoretic planning with non-Markovian rewards’, Artif. Intell. Research, 25, 17–74, (2006).
  • [17] R. Toro Icarte, E. Waldie, T. Klassen, R. Valenzano, M. Castro, and S. McIlraith, ‘Searching for Markovian subproblems to address partially observable reinforcement learning’, in Proceedings of the Fourth Multi-disciplinary Conference on Reinforcement Learning and Decision Making (RLDM-19). Proceedings of RLDM, (2019).
  • [18] F. Vaandrager, ‘Model Learning’, Communications of the ACM, 60(2), 86–96, (2017).