Learning Factored Markov Decision Processes with Unawareness

by   Craig Innes, et al.

Methods for learning and planning in sequential decision problems often assume the learner is aware of all possible states and actions in advance. This assumption is sometimes untenable. In this paper, we give a method to learn factored markov decision problems from both domain exploration and expert assistance, which guarantees convergence to near-optimal behaviour, even when the agent begins unaware of factors critical to success. Our experiments show our agent learns optimal behaviour on small and large problems, and that conserving information on discovering new possibilities results in faster convergence.



page 1

page 2

page 3

page 4


MDPs with Unawareness

Markov decision processes (MDPs) are widely used for modeling decision-m...

Reasoning about Unforeseen Possibilities During Policy Learning

Methods for learning optimal policies in autonomous agents often assume ...

Performance Guarantees for Homomorphisms Beyond Markov Decision Processes

Most real-world problems have huge state and/or action spaces. Therefore...

Robust Asymmetric Learning in POMDPs

Policies for partially observed Markov decision processes can be efficie...

What can I do here? A Theory of Affordances in Reinforcement Learning

Reinforcement learning algorithms usually assume that all actions are al...

Robust temporal difference learning for critical domains

We present a new Q-function operator for temporal difference (TD) learni...

Reinforcement Learning with Almost Sure Constraints

In this work we address the problem of finding feasible policies for Con...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Factored markov decision processes (fmdps) are a fundamental tool for modelling complex sequential problems. When the transition and reward functions of an fmdp are known in advance, there are tractable methods to learn its optimal policy via dynamic programming [Guestrin et al., 2003]. When these components are unknown, methods exist to jointly learn a structured model of the transition and reward functions [Degris et al., 2006, Araya-López et al., 2011]. Yet all such methods (with the exception of Rong [2016]) assume that the way the domain of the problem is conceptualized—the possible actions available to the agent and the belief variables that describe the state space—are completely known in advance of learning. In many scenarios, this assumption does not hold.

For example in medicine, suppose an agent prescribes a particular drug, but later a senior pharmacologist objects to the prescription based on a reason unforeseen by the agent—the patient carries a newly discovered genetic trait, and the drug produces harmful side effects in its carriers. Further, this discovery may occur after the agent has already learned a lot about how other (foreseen) factors impact the drug’s effectiveness. As Coenen et al. [2017] point out, such scenarios are common in human discussion—the answer to a person’s inquiry may not only provide information about which of the questioner’s existing hypotheses are likely, but may also reveal entirely new hypotheses not yet considered. This example also shows that while it may be infeasible for an agent to gather all relevant factors of a problem before learning, it may be easy for an expert to offer contextually relevant corrective advice during learning. Another example is in robotic skill learning. Methods such as Cakmak and Thomaz [2012] enable an expert to teach a robot how to perform a new action, but don’t teach when it’s optimal to use it. In lifelong-learning scenarios, we want to integrate new skills into existing decision tasks without forcing the robot to restart learning each time.

Current models of learning and decision making don’t address these issues; they assume the task is to use data to refine a distribution over a fixed hypothesis space. Under this framework, any change to the set of possible hypotheses constitutes an unrelated problem. The above examples, however, illustrate a sort of reverse bayesianism [Karni and Viero, 2013], where the hypothesis space itself expands over time.

Instead of overcoming unawareness of states and actions, we could just represent unawareness as an infinite number of hidden states by modelling the problem as an

infinite partially observable markov chain decision process

(ipomdp) [Doshi-Velez, 2009]. This approach has several drawbacks. First, ipomdps don’t currently address what to do when an unforeseen action is discovered. More importantly, since the hidden variables are not tied to grounded concepts with explicit meaning, it is difficult for an agent to justify its decisions to a user, or to articulate queries about its current understanding of the world so as to solicit help from an expert.

We instead propose a system where an agent makes explicit attempts to overcome its unawareness while constructing an interpretable model of its environment. This paper makes three contributions: First, an algorithm which incrementally learns all components of an fmdp. This includes the transition, reward, and value functions, but also the set of actions and belief variables themselves (Section 3). Second, an expert-agent communication protocol (Section 3.1) which interleaves contextual advice with learning, and guarantees our agent converges to near-optimal behaviour, despite beginning unaware of factors critical to success. Third, experiments on small and large sequential decision problems showing our agent successfully learns optimal behaviour in practice (Section 4).

2 The Learning Task

We focus on learning episodic, finite state fmdps with discrete states and actions. We begin with the formalisms for learning optimal behaviour in fmdps where the agent is fully aware of all possible states and actions. We then extend the task to one where the agent starts unaware of relevant variables and actions, and show how the agent overcomes this unawareness with expert aid.

2.1 Episodic Markov Decision Processes

An mdp is a tuple , where and are the set of states and actions; are possible start and end (terminal) states of an episode; is the markovian transition function , and is the immediate reward function.111In this paper, we assume the agent’s preferences depend only on the current state, and are both deterministic and stationary. Other works allow to depend on the action and/or resulting state (i.e., ). A policy

gives the probability

that an agent will take action in state . When referring to the local time in episode , we denote the current state and reward by and . When referring to the global time across episodes, we denote them by and .

The discounted return for episode is: , where is the discount factor governing how strongly the agent prefers immediate rewards. The agent’s goal is to learn the optimal policy , which maximizes the expected discounted return in all states. The value function defines the expected return when following a given policy , while the related action-value function gives the expected return of taking action in state , and thereafter following .


If and are known, we can compute via value iteration [Sutton and Barto, 1998]. Further, we can measure the expected loss in discounted return of following policy versus using (3), which we refer to as the policy error. If the agent’s policy is unknown, we can approximate the policy error using (4):


If all episodes eventually terminate, then (4) will converge to (3). If our agent is -greedy (that is, in all states, has probability of executing any action from at random), then termination in most mdps is guaranteed:

Definition 1 (Proper Policy).

A policy is proper if, from all states , acting according to guarantees one eventually reaches some terminal state .

Lemma 1.

If an mdp has a proper policy , then any policy which is -greedy with respect to is also proper.

2.2 Learning FMDPs when Fully Aware

If or are unknown, the agent must learn them using the data gathered from domain interactions. At time , the sequential trial gives the current state , action , resulting state and the reward given on entering . fmdps allow one to learn for large mdps by representing states as a joint assignment to a set of variables (Written as ). Similarly, the reward function is defined as a function , where are variables which determine the reward received in each state. To exploit conditional independence, is then represented by a

Dynamic Bayesian Network

(dbn) [Dean and Kanazawa, 1989] for each action. That is, , where . Here, is a directed acyclic graph with nodes where, as is standard, node denotes the value of variable at the current time, while denotes the same variable in the next time step. For each , defines the parents of . These are the only variables on which the value of depends. We also make the common assumption that our dbns contain no synchronic arcs [Degris and Sigaud, 2010], meaning .

This structure, along with the associated parameters , allow us to write transition probabilities as a product of independent factors: . Here, is the projection of onto the variables in , and denotes the probability of variable taking on value given that the agent performs action when the variables have assignment in the current time step.222If the context is clear, we condense this notation to .

Exploiting independence among belief variables doesn’t guarantee a compact representation of . We must also exploit the context-specific independencies between assignments by representing and as decision trees, rather than tables of values.

Figure 1 shows an example decision tree for and . The leaves are either rewards, or a distribution over the values of . The non-leaves are test nodes, which perform a binary test of the form to check whether variable takes on the value in the current state. Notice that when is true, the distribution over is conditionally independent of .

(a) Reward
(b) Conditional Probability
Figure 1: Example decision trees

Given trials

, we can estimate the most likely

dbn structure, tree structure and parameters, then subsequently estimate via a series of steps. Equation (5) is the Bayesian Dirichlet-Equivalent Score [Heckerman et al., 1995]

, which estimates the posterior probability

that the true parents of are

by first integrating out all the possible parameters of the local probability distribution.


Here, denotes the number of trials in in which action was taken in a state where the joint assignment to was , resulting in a state where has the value . The terms are the hyper-parameters from the prior dirichlet distribution over parameters, and act as “pseudo-counts” when data is sparse. The prior is typically chosen to favour simple structures:


Equation (6) penalizes dbns with many dependencies by attaching a cost for each parent in . Note, if the space of possible dbn

s is too large, we can restrict the parent sets considered reasonable by using common pruning heuristics or, for example, restricting the maximum in-degree.

Given , we can then compute each variable’s most likely conditional probability tree structure , restricting node tests to members of :


Rather than evaluating the probabilities of all possible dt structures each step, we can incrementally update the most likely dt as new trials arrive using incremental tree induction (iti), as described in Utgoff et al. [1997]. While we lack the space to describe iti in detail here, the algorithm broadly works by maintaining a single most-likely tree structure, with counts for all potential test assignments cached at intermediate nodes. As new trials arrive, the counts at relevant nodes become “stale”, as there might now exist an alternative test which could replace the current one, resulting in a higher value for equation (7). If such a superior test exists, the test at this node is replaced, and the tree structure is transposed to reflect this change. We can also use iti to learn a tree structure for based on the trials seen so far. The only difference is that we use an information-gain metric to decide on the best test nodes, rather than (7).

Finally, given , we compute the most likely parameters at each leaf via (9), where :


Once our agent has a transition and reward tree, we can then use structured value iteration (svi) [Boutilier et al., 2000]—a variant of value iteration which works with decision trees instead of tables—to compute a compact representation of . Algorithm 1 shows an outline of an incremental version of svi (isvi) [Degris et al., 2006], which allows the agent to gradually update its beliefs about the optimal value function in response to incoming trials. The algorithm takes the current estimate of the reward and transition functions ( and ), along with the previous estimate of the optimal value function (), and combines them to produce a new estimate for each state-action function (), and value function (). For further details about the merge and reduce functions used in svi, consult Boutilier et al. [2000].

1:function IncSVI(, , )
3:      (using maximization as the combination function)
4:     return
Algorithm 1 Incremental svi [Degris et al., 2006]

This section took an encapsulated approach to learning (In contrast to a unified one in e.g., Degris et al. [2006]). This means we separate the task of finding an optimal dbn structure from the task of learning each local dt structure. Such an approach significantly reduces the space of dts that must be considered, but more importantly, provides us with posterior distributions over parent structures. We will use these posterior distributions in section 3.2 to conserve information when adapting to new discoveries.

3 Overcoming Unawareness

So far, we’ve assumed our agent was aware of all relevant belief variables in , all actions , and all members of . We now drop this assumption. From here onward we denote the true set of belief variables, actions, and reward scope as , and , and the learner’s awareness of them at as , , and

Suppose , , , . We assume the agent can’t observe the value of variables it is unaware of. In the medical example from before, if corresponds to a particular gene, then we assume the agent cannot detect the presence or absence of that gene if it is unaware that it exists. Similarly, we assume the agent cannot perform an action it is unaware of.333This assumption, while reasonable, may always not hold (E.g., an agent may lean on a button while unaware it is part of the task). As a consequence, at time , the agent does not directly observe the true trial , but rather . The key point here is that awareness of those missing factors may be crucial to successfully learning an optimal policy. For example, the transition between observed states may not obey the markov property unless is observed, the best action may depend upon whether is true, or the optimal policy may sometimes involve performing . The next sections aims to answer two main questions. First, by what mechanisms can an agent discover and overcome its own unawareness by asking for help? Second, when an agent discovers a new belief variable or action, how can they integrate it into their current model while conserving what they have learned from past experience?

3.1 Expert Guidance

Our agent can expand its awareness via advice from an expert. Teacher-apprentice learning is common in the real world, as it allows learners to receive contextually relevant advice which may inform them of new concepts they would not otherwise encounter.

This paper assumes the expert has full knowledge of the true mdp, is cooperative, and infallible. Further, we abstract away the complexity of grounding natural language statements in a formal semantics and instead assume that the agent and expert communicate via a pre-specified formal language (though see e.g., Zettlemoyer and Collins [2007] for work on this problem). We do not, however, assume the expert knows the agent’s current beliefs about the decision problem.

As argued in the introduction, the goal is to provide a minimal set of communicative acts so that interaction between the agent and expert proceeds analogously to human teacher-apprentice interactions. Concretely, this means we want our system to have two properties. First, the expert should, for the most part, allow the agent the opportunity to learn by themselves, interjecting only when the agent is performing sufficiently poorly, or when the agent explicitly asks for advice. Secondly, following the gricean maxims of conversation [Grice, 1975], the expert should provide non-exhaustive answers to queries, giving just enough information to resolve the agent’s current query. We want this because in real world tasks with human experts it may be impossible to explain all details of a problem due to the cognitive constraints of the expert or costs associated with communication.

The next sections identify three types of advice whose combination guarantee the agent behaves optimally in the long run, regardless of initial awareness.

3.1.1 Better Action Advice

If the expert sees the agent perform a sub-optimal action, it can tell the agent a better action it could have taken instead. For example: “When it is raining, take your umbrella instead of your sun hat”. Our goal is to avoid incessantly interrupting the agent each time it makes a mistake, so we specify the following conditions for when the agent is performing sufficiently poorly to warrant correction: Let be the current (global) time step corresponding to the th step in the episode. Similarly, let , , be the time the expert last uttered advice. When (10-12) hold, the expert utters advice of the form (13):


Equation (10) ensures some minimum time has passed since the expert last gave advice. Equation (11) ensures the expert won’t interrupt unless its estimate of the agent’s policy error is above some threshold , or if the agent is unable to reach a terminal state after some reasonable bound (which is required because the agent’s unawareness of may mean its current -greedy policy is not proper). If episode is unfinished, the expert estimates the expected return via the heuristic , i.e., we optimistically assume the agent will follow from now on. Taken together, and describe the expert’s tolerance towards the agent’s mistakes. Finally, (12) ensures a better action actually exists at this time step.

Equation (13) is the expert’s utterance, and the term in it requires explanation. On first thought, the expert should utter , explicitly stating the full description of . However, remember that the agent’s awareness, , may be a tiny subset of . Uttering such advice may involve enumerating a huge number of variables the agent is currently unaware of. This is exactly the type of exhaustive explanation we wish to avoid, since such an explanation may place a cognitive burden on the expert, or confuse a learner. Conversely, we could instead have our expert project its intended utterance onto only those variables for which the expert has explicit evidence the agent is aware of them (i.e., utter: ). This can be understood by the agent without being made aware of any new variables, but might violate our assumption that the expert is truthful. For example, if , but .

The solution is to use a sense ambiguous term , whose intended denotation is the true state (i.e ), but whose default interpretation by the agent is . In words, it is as if the expert says “In the last step, it would have been better to do than .

Thus, by introducing ambiguity, the agent can interpret the advice in two ways. The first is as a partial description of the true problem, which is monotonically true regardless of what it learns in future. On hearing (13), the agent adds (14-15) to its knowledge:


Additionally however, the agent can choose to add its current default interpretation of the advice to its accumulated knowledge:


The agent can then act on the expert’s advice directly by choosing whenever , regardless of what seems likely from . We can see that even with a cooperative and infallible expert, even abstracting away issues of grounding natural language, misunderstandings can still happen due to differences in agent and expert awareness. As the next section shows, such misunderstandings can reveal gaps in the agent’s awareness and help to articulate queries whose answers guarantee the agent expands its awareness.

Lemma 2 guarantees the expert’s advice strategy reveals unforeseen actions to the agent so long as its performance in trials exceeds the expert’s tolerance.444Proofs of lemmas/theorems are in the technical supplement

Lemma 2.

Consider an fmdp where is proper, an agent with awareness , and expert acting with respect to (10-13). If then as , either with or the expert utters (12) such that .

3.1.2 Resolving Misunderstandings

We noted before that the agent’s defeasible interpretation of expert advice could result in misunderstandings. To illustrate, suppose the agent receives advice (17) and (18) at times and :


While the intended meaning of each statement is true, the agent’s default interpretations of and may be identical. That is, . From the agent’s perspective, (17) and (18) conflict, and thus give the agent a clue that its current awareness of is deficient. To resolve this conflict, the agent asks (19) (in words, “which has distinct values in and ?”) and receives an answer of the form (20):


Notice there may be multiple variables in whose assignments differ in and . Thus, the expert’s answer can be non-exhaustive, providing the minimum amount of information to resolve the agent’s conflict without necessarily explaining all components of the task. This means the agent must abandon its previous defeasible interpretation of (16), but can keep (14-15), as these are true regardless of known variables. Lemma 3 guarantees the expert will reveal new belief variables, provided such misunderstandings can still arise.

Lemma 3.

Consider an fmdp where is proper and an agent with awareness . If , and , then as , either , or the expert utters (20) such that

3.1.3 Unexpected Rewards

In typical fmdps (where the agent is assumed fully aware of , and ), we tend only to think of the trials as providing counts, but for an unaware agent, a trial also encodes monotonic information:


This constrains the form of the agent must learn. Recall that , may be only a subset , so it might be impossible to construct an satisfying all descriptions (21) gathered so far. Further, those extra variables in may not be in . To resolve this, if the agent fails to construct a valid reward function, it asks (22) (in words, “which variable X (that I don’t already know) is in ?”), receiving an answer (23):


Again, the agent may be unaware of many variables in , so (23) may be non exhaustive. Even so, we can guarantee that the agent’s learned reward function eventually equals :

Lemma 4.

Consider an fmdp where is proper and an agent with awareness , , . As , there exists a such that for all , for all states reachable using .

3.2 Adapting the Transition Function

Section 3.1 showed three ways the agent could expand its awareness of , , and . If we wish to improve on the naive approach of restarting learning when faced with such expansions, we must now specify how the agent adapts and to such discoveries.

Adapting upon discovering a new action at time is simple: Since the agent hasn’t performed in any previous trial, it can just create a new dbn, , using the priors outlined in section 2.2. Our new model at time then becomes .

The more difficult issue is adapting upon discovering a new belief variable . The main problem is that the agent’s current distributions over dbns no longer cover all possible parent sets for each variable, nor all dts. For example, the current distribution over does not include the possibility that is a parent of . Worse, since we assume in general that the agent cannot observe ’s past values in , it cannot observe the true value of , nor when . The -parameters involving are also undefined, yet we need them to calculate structure probabilities (5, 7) and parameters via (9).

The problem is that new variables make the size of each (observed) state dynamic, in contrast to standard problems where they are static (e.g., becomes ) . We could phrase this as a missing data problem: was hidden in the past but visible in future states, so treat the problem as a pomdp and estimate missing values via e.g., expectation maximization [Friedman, 1998]. However, such methods commit us to costly passes over the full state-action history, and make it hard to learn dt structures with enough sparseness to ensure a compact value function. Alternatively, we could ignore states with missing information when counts involving are required. For example, we could use to score when but use when . However, as Friedman and Goldszmidt [1997] points out, most structure scores, including (5), assume we evaluate models with respect to the same data. If two models are compared using different data sets (even if they come from the same underlying distribution), the learner tends to favour the model evaluated with the smaller amount of data. Instead, our method discards the data gathered during the learner’s previous deficient view of the hypothesis space, but conserves the relative posterior probabilities learned from past data to construct new priors for the , and in the expanded belief space.

3.2.1 Parent Set Priors

On discovering , the agent must update for each and to include parent sets containing . In (24) we construct a new prior using the old posterior:


This preserves the likelihoods among the parent sets that do not include . It also maintains our bias towards simpler structures by re-assigning only a portion of the probability mass to parent sets including . To define —the distribution over parent sets for the newly discovered variable —we default to (6), since the agent has no evidence (yet) concerning ’s parents.

3.2.2 Decision Tree and Parameter Priors

We must also update , and to accommodate . Here, we return to the issue of the counts and the associated -parameters. As mentioned earlier, we wish to avoid the complexity of estimating ’s past values. Instead, we throw away the past counts of , but retain the relative likelihoods they gave rise to by packing these into new -parameters, as shown in (25-26):


Equation (25) summarizes via inferences on the old best dbns, then encodes these inferences in the new -parameters. The revised -parameters ensure the new tree structure prior and expected parameters defined via (26) and (9) bias towards models the agent previously thought were likely. Indeed, the larger the (user-specified) parameter is, the more the distributions learned before discovering influence the agent’s reasoning after discovering .

3.3 Adapting Reward and Value Trees

On becoming aware is part of , the agent may wish to restructure its reward tree. This is because awareness that means there are tests of the form that the agent has not yet tried which may produce a more compact tree. In the language of iti, the current test nodes are “stale”, and must be re-checked to see if a replacement test would yield a tree with better information gain. If the agent was unaware of (i.e, ), we can still test on assignments to by following the iti convention that any state where is missing automatically fails any test on .

Once we have updated and , there is no need to make further changes to in response to a new action or variable . In effect, this encodes our conservative intuition that the true is more likely to be closer to the agent’s current estimate than some arbitrary value function. The agent essentially assumes (in absence of further information) that the value of a state is indifferent to this newly discovered factor. In subsequent trials where the agent performs or observes , algorithm 1 ensures information about this new factor is incorporated into the agent’s value function.

1:function LearnFMDPU(, , , , , )
2:     for  do
3:          -greedy(, , )
4:          Add via (5-9) & iti
5:         if Update to fails then
6:               Ask expert (19)
7:               Append to each
8:               Update via iti          
9:         if  (10-12) are true then
10:               Expert advice of form (13)
11:              if  mentions action  then
13:                   made via (6)               
14:              if  conflicts with  then
15:                   Ask expert (19)
17:         if  then
18:               Update via (25, 5, 26, 7, 9)          
19:          IncSVI()      
Algorithm 2 Learning fmdps with Unawareness

Algorithm 2 outlines how the agent updates , , and in response to new data and expert advice. Given algorithm 2, theorem 1 guarantees our agent behaves indistinguishably from a near-optimal policy in the long run, regardless of initial awareness (provided all are relevant to expressing the optimal policy).

Theorem 1.

Consider an fmdp where is proper and an agent with initial awareness , and acts according to algorithm 2. If for all , there exists a pair of states such that , , and , then as , such that

4 Experiments and Results

Our experiments show that agents following algorithm 2 converge to near-optimal behaviour in both theory and practice. Further, we show that conserving information on and gathered before each new discovery allows our agent learn faster than one which abandons this information. We do not investigate assigning an explicit budget to agent-expert communication, leaving this to future work. However we do show how varying the expert’s tolerance affects the agent’s performance.

We test agents on two well-known problems: Coffee-Robot and Factory.555Full specifications at https://cs.uwaterloo.ca/~jhoey/research/spudd/index.php In each, our agent begins with only partial awareness of , and . The agent takes actions for time steps, using an -greedy policy (). When the agent enters a terminal state, we reset it to one of the initial states randomly. We use the cumulative reward

across all trials as our evaluation metric, which acts as a proxy for the quality of the agent’s policy over time. To make the results more readable, we apply a discount of

at each step, resulting in the metric .

We test several variants of our agent to show the effectiveness of our approach. The default agent follows algorithm 2 as is, with parameters , , , , in equations (6), (24), (25), and (10-12) respectively. The nonConservative agent does not conserve information about , nor via (24-26) when a new factor is discovered. Instead, it resets and to their initial values. This agent is included to show the value of conserving past information as and expand. The truePolicy and random agents start with full knowledge of the true fmdp, and execute an -greedy version of , or a choose random action respectively. These agents provide an upper/lower bound on performance. The lowTolerance / highTolerance agents change the expert’s tolerance to and .

4.1 Coffee Robot

Coffee-Robot is a small sequential problem where a robot must purchase coffee from a cafe, then return it to their owner. Also, the robot gets wet if it has no umbrella when it rains. The problem has 6 boolean variables—huc (user has coffee), hrc (robot has coffee), r (raining), w (wet), l (location), u (umbrella)— and 4 actions—move, delc, buyc and getu— making 256 state/action pairs. The terminal states are those where ; initial states are all non-terminal ones. Our agent has initial awareness , and discount factor .666Original setting was . Changed to make proper.

(a) Coffee Robot (. Average of 50 experiments)
(b) Factory (. Average of 20 experiments)
Figure 2:

Cumulative Rewards. Shaded areas represent standard error from the mean.

Figure 1(a) shows each agent’s (discounted) cumulative reward. Despite starting unaware of factors critical to success, the default agent quickly discovers the relevant actions and beliefs with the expert’s aid, and converges on the optimal policy. The non-conservative agent also learns the optimal policy, but takes longer. This shows the value of conserving and on discovering new beliefs. We also see how expert tolerance affects performance. The agent paired with high tolerance expert learns a (marginally) worse final policy, but this makes little difference to cumulative reward. Figure 3 shows why: The agent learned a “good enough” policy, so the expert doesn’t reveal the “get umbrella” (getu) action, which yields only a minor increase in reward. Figure 4 supports this explanation, showing how more tolerant experts reveals less variables over time.

(a) Default Tolerance
(b) High Tolerance
Figure 3: Typical final policy depending on tolerance
(a) Coffee Robot task
(b) Factory task
Figure 4: Awareness of and

4.2 Factory

Factory is a larger problem (, , 774144 state/action pairs), which shows our method works on more realistically sized tasks. Here, an agent must shape, paint and connect two widgets to create products of varying quality. Some actions (like bolting) produce high quality products, whereas others (like gluing) produce low quality products. The agent receives a higher reward for producing goods which match the demanded quality.777Rewards were scaled to range 0.0-1.0 and, to make proper, terminal states which previously gave reward were given a small reward of .. The terminal states are those where ; initial states are non-terminals where it is possible to connect two components. Our agent’s initial awareness is , , with . This represents a simplified task where the agent thinks the only goal is connecting the widgets.

Figure 1(b) shows results similar to previous experiments. The default agent converges on optimal behaviour, and does so quicker than the non-conservative agent. Varying the expert’s tolerance now has a larger effect on the rate at which factors are discovered and on convergence towards the optimal policy (presumably because there are many more unforeseen variables/actions the agent can discover in this larger problem).

5 Related Work

Models of unawareness exist in logic and game theory

[Board et al., 2011, Heifetz et al., 2013, Feinberg, 2012], but interpret (un)-awareness from an omniscient view. We instead model awareness from the agent’s view and offer methods to overcome one’s own unawareness.

Rong [2016] defines unawareness similarly to us, using markov decision processes with unawareness (mdpus) to learn optimal behaviour when an agent starts unaware of some actions. They apply mdpus to a robotic-motion problem with around 1000 discretised atomic states. The agent uses an explore move, which randomly reveals useful motions they were previously unaware of. Our work differs from theirs in several ways. First, we provide a concrete mechanism for discovering unforeseen factors via expert advice, rather than random discovery from the agent’s own exploration. Second, we allow the agent to discover explicit belief variables rather than atomic states, and focus more on exploiting the inherent structure in problems with a large number of features. This enables us to scale up to complex decision problems, where the agent converges on an optimal policies in a (true) state space around a million atomic states, as opposed to around 1000. McCallum and Ballard [1996] also learn an increasingly complex representation of the state space by gradually distinguishing between states which yield different rewards. Rather than dealing with unawareness, their approach focusses on refining an existing state space. In other words, they do not support introducing unforeseen states or actions that the learner was unaware of before learning.

Several works use expert interventions to improve performance via reward shaping and corrections [Stone, 2009, Torrey and Taylor, 2013]. Yet all such methods assume the expert’s intended meaning can be understood without expanding the agent’s current state and action space. Our work allows experts to utter advice where ambiguity arises from their greater awareness of the problem.

6 Conclusion

We have presented an agent-expert framework for learning optimal behaviour in both small and large fmdps even when one starts unaware of factors critical to success. Further, we’ve shown that conserving one’s beliefs helps improve the effectiveness of learning. In future work, we aim to lift some assumptions imposed on the expert, and expand the expressiveness of its advice. For instance, we could let the expert be fallible, or allow questions on the structure of , as Masegosa and Moral [2013] do for Bayesian Networks.


  • Araya-López et al. [2011] M. Araya-López, O. Buffet, V. Thomas, and F. Charpillet. Active learning of MDP models. In

    European Workshop on Reinforcement Learning

    , pages 42–53. Springer, 2011.
  • Board et al. [2011] O. J. Board, K.-S. Chung, and B. C. Schipper. Two models of unawareness: Comparing the object-based and the subjective-state-space approaches. Synthese, 179(1):13–34, 2011.
  • Boutilier et al. [2000] C. Boutilier, R. Dearden, and M. Goldszmidt. Stochastic dynamic programming with factored representations. Artificial Intelligence, 121(1):49–107, Aug. 2000. ISSN 0004-3702. doi: 10.1016/S0004-3702(00)00033-3. URL http://www.sciencedirect.com/science/article/pii/S0004370200000333.
  • Cakmak and Thomaz [2012] M. Cakmak and A. Thomaz. Designing robot learners that ask good questions. Proceedings of the 7th Annual ACM/IEEE International Conference on Human-Robot Interaction, 2012.
  • Coenen et al. [2017] A. Coenen, J. D. Nelson, and T. M. Gureckis. Asking the right questions about human inquiry. OpenCoenen, Anna, Jonathan D Nelson, and Todd M Gureckis.“Asking the Right Questions About Human Inquiry”. PsyArXiv, 13, 2017.
  • Dean and Kanazawa [1989] T. Dean and K. Kanazawa. A model for reasoning about persistence and causation. Computational intelligence, 5(2):142–150, 1989.
  • Degris and Sigaud [2010] T. Degris and O. Sigaud. Factored markov decision processes. Markov Decision Processes in Artificial Intelligence, pages 99–126, 2010.
  • Degris et al. [2006] T. Degris, O. Sigaud, and P.-H. Wuillemin. Learning the structure of factored markov decision processes in reinforcement learning problems. In

    Proceedings of the 23rd international conference on Machine learning

    , pages 257–264. ACM, 2006.
  • Doshi-Velez [2009] F. Doshi-Velez. The infinite partially observable Markov decision process. In Advances in neural information processing systems, pages 477–485, 2009.
  • Feinberg [2012] Y. Feinberg. Games with unawareness. 2012.
  • Friedman [1998] N. Friedman. The Bayesian structural EM algorithm. In Proceedings of the Fourteenth conference on Uncertainty in artificial intelligence, pages 129–138. Morgan Kaufmann Publishers Inc., 1998. URL http://dl.acm.org/citation.cfm?id=2074110.
  • Friedman and Goldszmidt [1997] N. Friedman and M. Goldszmidt. Sequential update of Bayesian network structure. In Proceedings of the Thirteenth conference on Uncertainty in artificial intelligence, pages 165–174. Morgan Kaufmann Publishers Inc., 1997. URL http://dl.acm.org/citation.cfm?id=2074246.
  • Grice [1975] H. P. Grice. Logic and conversation. 1975, pages 41–58, 1975.
  • Guestrin et al. [2003] C. Guestrin, D. Koller, R. Parr, and S. Venkataraman. Efficient solution algorithms for factored MDPs. Journal of Artificial Intelligence Research, 19:399–468, 2003.
  • Heckerman et al. [1995] D. Heckerman, D. Geiger, and D. M. Chickering. Learning Bayesian networks: The combination of knowledge and statistical data. Machine learning, 20(3):197–243, 1995.
  • Heifetz et al. [2013] A. Heifetz, M. Meier, and B. C. Schipper. Dynamic unawareness and rationalizable behavior. Games and Economic Behavior, 81:50–68, 2013.
  • Karni and Viero [2013] E. Karni and M.-L. Viero. ”Reverse Bayesianism”: A choice-based theory of growing awareness. American Economic Review, 103(7):2790–2810, 2013.
  • Masegosa and Moral [2013] A. R. Masegosa and S. Moral. An interactive approach for Bayesian network learning using domain/expert knowledge. International Journal of Approximate Reasoning, 54(8):1168–1181, Oct. 2013. ISSN 0888-613X. doi: 10.1016/j.ijar.2013.03.009. URL http://www.sciencedirect.com/science/article/pii/S0888613X13000698.
  • McCallum and Ballard [1996] A. K. McCallum and D. Ballard. Reinforcement learning with selective perception and hidden state. PhD thesis, University of Rochester. Dept. of Computer Science, 1996.
  • Rong [2016] N. Rong. Learning in the Presence of Unawareness. PhD thesis, Cornell University, 2016.
  • Stone [2009] W. B. K. a. P. Stone. Interactively Shaping Agents via Human Reinforcement: The TAMER Framework. 2009. URL http://www.cs.utexas.edu/users/ai-lab/?KCAP09-knox.
  • Sutton and Barto [1998] R. Sutton and A. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998.
  • Torrey and Taylor [2013] L. Torrey and M. Taylor. Teaching on a budget: Agents advising agents in reinforcement learning. In Proceedings of the 2013 international conference on Autonomous agents and multi-agent systems, pages 1053–1060. International Foundation for Autonomous Agents and Multiagent Systems, 2013.
  • Utgoff et al. [1997] P. E. Utgoff, N. C. Berkman, and J. A. Clouse. Decision tree induction based on efficient tree restructuring. Machine Learning, 29(1):5–44, 1997.
  • Zettlemoyer and Collins [2007] L. Zettlemoyer and M. Collins. Online learning of relaxed CCG grammars for parsing to logical form. In

    Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)

    , 2007.