Reasoning about Unforeseen Possibilities During Policy Learning

by   Craig Innes, et al.

Methods for learning optimal policies in autonomous agents often assume that the way the domain is conceptualised---its possible states and actions and their causal structure---is known in advance and does not change during learning. This is an unrealistic assumption in many scenarios, because new evidence can reveal important information about what is possible, possibilities that the agent was not aware existed prior to learning. We present a model of an agent which both discovers and learns to exploit unforeseen possibilities using two sources of evidence: direct interaction with the world and communication with a domain expert. We use a combination of probabilistic and symbolic reasoning to estimate all components of the decision problem, including its set of random variables and their causal dependencies. Agent simulations show that the agent converges on optimal polices even when it starts out unaware of factors that are critical to behaving optimally.



page 1

page 2

page 3

page 4


Learning Factored Markov Decision Processes with Unawareness

Methods for learning and planning in sequential decision problems often ...

MDPs with Unawareness

Markov decision processes (MDPs) are widely used for modeling decision-m...

Learning without Knowing: Unobserved Context in Continuous Transfer Reinforcement Learning

In this paper, we consider a transfer Reinforcement Learning (RL) proble...

Overabundant Information and Learning Traps

We develop a model of social learning from overabundant information: Age...

Off-Belief Learning

The standard problem setting in Dec-POMDPs is self-play, where the goal ...

Cluster-Based Social Reinforcement Learning

Social Reinforcement Learning methods, which model agents in large netwo...

Inverse POMDP: Inferring What You Think from What You Do

Complex behaviors are often driven by an internal model, which integrate...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Consider the following decision problem from the domain of crop farming (inspired by Kristensen and Rasmussen (2002)): Each harvest season, an agent is responsible for deciding how best to grow barley on the land it owns. At the start of the season, the agent makes some decisions about which grain variety to plant and how much fertiliser to use. Come harvest time, those initial decisions affect the yield and quality of the crops harvested. We can think of this problem as a single-stage (or “one-shot”) decision problem, in which the agent chooses one action based on a set of observations, then receives a final reward based on the outcome of its action.

Suppose the agent has experience from several harvests, and believes it has a good idea of the best seeds and fertiliser to use for a given climate. One harvest, something totally unexpected happens: Despite choosing what it thought were the best grains and fertiliser, many of the crops have come out deformed and of poor quality. A neighbouring farmer tells the agent the crops have been infected by a fungus that spreads in high temperatures, and that the best way to protect crops in future is to apply fungicide. The agent must now revise its model of the environment and its reward function to account for fungus (a concept it previously was not aware of), extend its available actions to include fungicide application (an action it previously did not realise existed), reason about the probabilistic dependencies these new concepts have to the ones it was already aware of, and reason about how they affect rewards.

This example illustrates at least three challenges, the combination of which typically is not handled by current methods for learning optimal decision policies (we defer a detailed discussion of related work to Section 6):

  • In addition to starting unaware of the probabilistic dependency structure of the decision problem, the agent starts unaware of even the true hypothesis space of possible problem structures, including the sets of possible actions and environment variables, and their causal relations.

  • Domain exploration alone might not be enough to discover these unknown factors. For instance, it is unlikely the agent would discover the concept of fungus or the action of fungicide application by just continuing to plant crops. External expert instruction is necessary to overcome unawareness.

  • An expert might interject with contextually relevant advice during learning, not just at the beginning of the problem. Further, that advice may refer to concepts which are not a part of the agent’s current model of the domain.

In the face of such strong unawareness, one might be tempted to side-step learning an explicit model of the problem, and instead learn optimal behaviour directly from the data. Deep reinforcement learning (e.g.

Mnih et al. (2015)

) has proved extremely useful for learning implicit representations of large problems, particularly in domains where the input sensory streams are complex and high-dimensional (e.g computer vision). However, in many such models, the focus of attention for abstraction and representation learning is on perceptual features rather than causal relations (see, e.g.,

Pearl (2017)) and other decision-centric attributes. For instance, in works such as by Chen et al. (2015), who demonstrate an end-to-end solution to the problem of autonomous driving, the decisions are not elaborated beyond “follow the lane” or “change lanes” although significant perceptual representation learning may need to happen in order to work with the predicate “lane” within the raw sensory streams. For safety critical decisions, or ones involving significant investment (e.g. driving a car, advising on a medical procedure, deciding on crops to grow for the year) it is important 111As evidenced by significant recent interest from agencies interested in the application of AI, e.g., the DARPA Explainable AI programme DARPA-BAA-16-53 (2016) that a system can explain the reasoning behind its decisions so the user can trust its judgement.

We present a learning agent that, in a complementary approach to representation learning in batch mode from large corpora, uses evidence and a reasoning mechanism to incrementally construct an interpretable model of the decision problem, based on which optimal decision policies are computed. The main contributions of this paper are:

  • An agent which, starting unaware of factors on which an optimal policy depends, learns optimal behaviour for single-stage decision problems via direct experience and advice from a domain expert. The learning uses decision networks to represent beliefs about the variables and causal structure of the decision problem, providing a compact and interpretable model. Crucially, the agent can revise all components of this model, including the set of random variables, their causal dependencies, and the domain of the reward function (Section 4).

  • A communication framework via which an expert can offer both solicited and unsolicited advice to the agent during learning: that is, the expert advice is offered piecemeal, in reaction to the agent’s latest attempts to solve the task, rather than all of it being conveyed prior to learning. Messages from the expert can include entirely new concepts which the agent was previously unaware of, and provide important qualitative information to complement the quantitative information conveyed by statistical correlations in the domain trials (Section 3.2).

  • Experiments across a suite of randomly generated decision problems, which demonstrate that our agent can learn the optimal policy from evidence, even if it were when initially unaware of variables that are critical to its success (Section 5).

The kinds of applications we ultimately have in mind for this work include tasks in which there is a need for flexible and robust responses to a vast array of contingencies. In particular, we are interested in the paradigm of continual (or life-long) learning (Silver, 2011; Thrun and Pratt, 2012), wherein the agent must continually and incrementally add to its knowledge, and so revise the hypothesis space of possible states and actions within which decisions are made. In this context, there is a need for autonomous model management (Liebman et al., 2017), which calls for reasoning about what the hypothesis space is, in addition to policy learning within those hypothesis spaces.

2 The Learning Task

We consider learning in single-stage decision problems. In these problems, the agent chooses an action based on a set of initial observations, then immediately receives a final reward based on the outcome of its action. Subsequent repetitions of the same decision problem have mutually independent initial observations, and the immediate reward depends only on the current action and its outcome. Solving the problem of unforeseen possibilities for single-stage scenarios is a necessary first step towards the long-term goal of extending this work to multi-stage, sequential decision problems.

To learn an optimal decision policy, the agent must compute which action will maximise its expected reward, given its observations of the state in which the action is to be performed. Formally, the optimal action given observations is the action which maximizes expected utility:


Here, is the set of action variables and is the set of chance variables (or state variables). An action is an assignment to each of the action variables in , such that with , , etc. (We use to denote the Cartesian product of all sets in .) Similarly, a state is an assignment to each chance variable in . The reward received in state is .

In our task, the agent faces the extra difficulty of starting unaware of certain actions and concepts (i.e., chance variables) on which the true optimal policy depends. Consequently, it begins learning with an incomplete hypothesis space. It also has an incorrect model of the domain’s causal structure—i.e., there may be missing or incorrect dependencies. Further, the learning agent will start with an incomplete or incorrect reward function. The agent’s learning task, then, is to use evidence to converge on an optimal policy, despite beginning with an initial model defined over an incomplete set of possible states and actions and incorrect dependencies among them.

Since we are interested in interpretable solutions, our approach is that the learning agent uses evidence to dynamically construct an interpretable model of the decision problem from which an optimal policy can be computed. Formally, we treat this task of learning an optimal policy as one of learning a decision network (dn). dns capture preferences with a numeric reward function and beliefs with a Bayes Net, thereby providing a compact representation of all the components in equation (1).

Definition 1

Decision Network
A Decision Network (dn) is a tuple , where is a set of chance variables, is a set of action variables (the agent controls their values), and is a reward function whose domain is and range is .222 Agents do not have intrinsic preferences over actions, but rather over their outcomes. is a Bayes Net defined over . That is, is a directed acyclic graph (dag), defining for each its parents , such that is conditionally independent of given . defines for each variable

its conditional probability distribution


A policy for a dn is a function from the observed portion of the current state (i.e., is a subvector of ) to an action . As is usual with dns, any variable whose value depends on the action performed (in other words, is a descendant of ) cannot be observed until after performing an action. More formally, , where the “before” variables are non-descendants of , and the “outcome” variables are descendants of (i.e., where is the closure of , iff such that ). So a policy for the dn is a function from to , where are the observable variables in .

Figure 0(a) shows a dn representation of the barley example from the introduction. The action variables (rectangles) are , the chance variables (ovals) are where includes , , etc. and includes , , , and . The reward domain is .

Our dn formulation has some similarities to influence diagrams (Howard and Matheson, 2005). In contrast to DNs, influence diagrams allow action nodes to have parents via “information arcs” from chance nodes. While this feature is useful to model multi-stage decision problems, we do not require it here since all action variables are assigned simultaneously.

Our agent will incrementally learn all components of the dn, including its set of random variables, dependencies and reward function. We denote the true dn by , and the agent’s current model of the dn at time as . A similar convention is used for each component of the dn (so, for instance, is the set of action variables the agent is aware of at time ). Thus, we compute an update function , where is the latest body of evidence and the dns and may differ in any respect. Figure 0(b) gives an example of an agent’s possible starting model for the barley example.

Notice that is missing factors that influence the optimal policy defined by (e.g. it is unaware of the concept of fungus). It is also missing dependencies that are a part of (e.g. it does not think the choice of grain has any influence on the amount of crops that will grow). One usually assumes that all variables in a dn are connected to the utility node because a variable that is not has no effects on optimal behaviour. So the agent knows that is so connected, but ’s set of variables, causal structure and reward function are all hidden and must be inferred from evidence.

(a) True Decision Problem
(b) Agent’s Initial Model
Figure 1: The graphical component of the dn for the “Barley” problem. Action variables are represented by rectangles, chance variables by ovals, and the reward node by a diamond.

We make four main assumptions to restrict the scope of the learning task:

  1. The agent can observe the values of all the variables it is aware of (so the domain of its policy function at time step is ). However, it cannot observe a chance variable’s values at times before it was aware of it, even after becoming aware of it—i.e. the agent cannot re-perceive past domain trials upon discovering an unforeseen factor.

  2. The agent cannot perform an action it is unaware of. Formally, if but , then the fully aware expert perceives the learning agent’s action as entailing (plus other values of other variables in ). This differs from the learning agent’s own perception of its action: is not a part of the agent’s representation of what it just did because it is not aware of ! Once the agent becomes aware of , then knowing that inadvertent actions are not possible, it can infer all its past actions entail . This contrasts with chance variables, whose values at times when the agent was unaware of them will always be hidden. Clearly, this assumption does not hold across all decision problems (e.g. an agent might inadvertently lean on a button, despite not knowing the button exists). If we wished to lift this assumption, we could simply treat action variable unawareness in the same way as we treat chance variable unawareness (as described in Section 4).

  3. The set of random variables in the agent’s initial dn is incomplete rather than wrong: that is, , and . Further, the initial domain of is a subset of its true domain: i.e., . This constraint together with the dialogue strategies in Section 3.2 simplify reasoning: dn updates may add new random variables but never removes them; and may extend the domain of but never retracts it. However, the causal structure and reward function can be revised, not just refined.

  4. The expert has complete knowledge of the actual decision problem but lacks complete knowledge of —the learning agent’s perception of its decision problem at time . Further, we make the expert cooperative—her advice is always sincere, competent and relevant (see Section 3.2 for details).

Any competent agent attempting this learning task should obey two key principles:


At all times , should be consistent. That is, should be a dag defined over the vocabulary , should abide by the basic laws of probability, and should be a well-defined function that captures an asymmetric and transitive preference relation over .


Evidence is informative about what can happen (however rarely), not just informative about likelihood. At all times , should satisfy all the possibilities that are entailed by the observed evidence so far.

Consistency is clearly desirable, because anything can be inferred from an inconsistent dn, making any action optimal. Satisfaction is not an issue in traditional approaches to learning optimal policies, because evidence never reveals a possibility that is not within the hypothesis space of the learning agent’s initial dn. But in our task, without Satisfaction, the posterior dn may fail to represent an unforeseen possibility that is monotonically entailed by observed evidence. It would then fail to capture the unforeseen possibility’s effects on expected utilities and optimal policy. Our experiments in Section 5 show that when starting out unaware of factors on which optimal policies depend, a baseline agent that does not comply with Satisfaction performs worse than an agent that does. We will say is valid if it complies with Consistency and Satisfaction; it is invalid otherwise.

A valid dn is necessary, but not sufficient: should not only be valid, but in addition its probabilistic component should reflect the relative frequencies in the domain trials. Section 4 will define a dn update procedure with all of these properties.

The set of valid dns is always unbounded—it is always possible to add a random variable to a valid dn while preserving its validity. So in addition to the above two monotonic constraints on dn update, we adopt two intuitively compelling defeasible principles that make dn update tractable. Indeed, the agent needn’t enumerate all valid dns; instead, dn update uses the defeasible principles to dynamically construct a single dn from evidence:


The dn should have minimal complexity. In other words, its random variables discriminate one possible state from another, two states have distinct payoffs, and/or two factors are probabilistically dependent only when evidence justifies this.


The agent should minimise changes to the dn’s hypothesis space when observing new evidence.

Minimality is a form of Occam’s razor: make the model as simple as possible while accounting for evidence. Conservativity captures the compelling intuition that you preserve as much as possible of what you inferred from past evidence even when you have to revise the dn to restore consistency with current evidence. Minimality and Conservativity underlie existing symbolic models of commonsense reasoning (Poole, 1993; Hobbs et al., 1993; Alchourrón et al., 1985), the acquisition of causal dependencies (Bramley et al., 2015; Buntine, 1991) and preference change (Hansson, 1995; Cadilhac et al., 2015).

Our final desirable feature is to support active learning: to give the agent some control over the evidence it learns from next, both from the domain trials (Section 3.1) and the dialogue content (Section 3.2).

3 Evidence

We must use evidence to overcome ignorance about how to conceptualise the domain, not just ignorance about likelihoods and payoffs. We regiment this by associating each piece of evidence with a formula that expresses (partial) information about the true decision problem , where follows monotonically from , in the sense that it would be impossible to observe unless is true of the decision problem that generated . Thus is a partial description of a dn that must be satisfied by the actual (complete) decision network .

We illustrate this with three examples. First, given the assumptions we made about the relationship between the agent’s initial dn and (i.e., that , , and ), in Figure 0(b) yields the partial description given in (2):


The second example concerns domain trials: suppose the agent is in a state where it experiences the reward and observes that the variable has value and the variable has value . Equation (3) must be true of , where the domain of quantification is ’s atomic states:


Thirdly, suppose the expert advises the agent to apply pesticide. Then, must be true.

We capture this relationship between a complete dn and formulae like (2) and (3) by defining a syntax and semantics of a language for partially describing dns (details are in the Appendix). Each model for interpreting the formulae in this language corresponds to a unique complete dn, and if and only if (partially) describes . Where represents the properties that a dn generating observed evidence must have, (because generated ), although other dns may satisfy too. This relationship between , and enables the agent to accumulate an increasingly specific partial description of as it observes more and more evidence: where is the sequence of evidence and its associated partial description of , observing the latest evidence yields with an associated partial description that is the conjunction of and . The agent will thus estimate a valid dn from evidence (as defined in Section 2) under the following conditions:

  1. captures the necessary properties of a dn that generates

  2. The agent’s model obeys this partial description. That is,

This section describes how we achieve the first condition; Section 4 defines how we achieve the second, with Section 4.2 also ensuring that the probabilistic component reflects the relative frequencies in the domain trials.

3.1 Domain Evidence:

Domain evidence consists of a set of domain trials. In a domain trial , the agent observes , performs an action , and observes its outcome and reward

. From now on, for notational convenience, we may omit the vector notation, or freely interchange vectors with conjunctions of their values. Each domain trial therefore consists of a tuple:


The domain trial entails the partial description (5) of : in words, there is an atomic state in that entails the observed values , and and which has the payoff .


Formula (5) follows monotonically from because the agent’s perception of a domain trial is incomplete but never wrong: the agent’s random variables have observable values, and they are always a subset of those in thanks to the agent’s starting point and the dialogue strategies (see Section 3.2). Thus, even if the agent subsequently discovers a new random variable , (5) still follows from , regardless of ’s value at time (which remains hidden to the agent). But on discovering , tuples in (4) get extended—the agent will observe ’s value in subsequent domain trials. Thus in contrast to standard domain evidence, the size of the tuples in (4) is dynamic. We discuss in Section 4 how the agent copes with these dynamics. The expert also keeps a record of the domain trials; these influence her dialogue moves (see Section 3.2). Her representation of each trial is at least as specific as the agent’s because she is aware of all the variables—so the size of the tuples in is static.

Domain trials can reveal to the agent that its conceptualisation of the domain is deficient. If there are two trials in with the same observed value for (i.e., the current estimated domain of the reward function) but the rewards are different, then this entails that is invalid. If contains two domain trials with the same observed values for every chance variable of which the agent is currently aware, but the rewards are different, then the vocabulary is invalid. Section 4 will define how the agent detects and learns from these circumstances.

The agent’s strategy for choosing an action mixes exploitation and exploration in an -greedy approach: In a proportion of the trials, the agent chooses what it currently thinks is an optimal action. In the remainder, the agent chooses an action at random.

3.2 Dialogue Evidence:

The interaction between the agent and the expert consists of the agent asking questions that the expert then answers, and unsolicited advice from the expert. All the dialogue moves are about , and the signals are in the formal language for partially describing dns (see the Appendix), but with the addition of a sense ambiguous term, which we motivate and describe shortly. The agent’s and expert’s lexica are different, however: the agent’s vocabulary lacks the random variables in that it is currently unaware of, and so the expert’s utterances may feature neologisms. As we said earlier, we bypass learning how to ground neologisms (but see Larsson (2013); Forbes et al. (2015); Yu et al. (2016)) by assuming that once the agent has heard a neologism, it can observe its denotation in all subsequent domain trials.

The sense ambiguous term is : its intended denotation is what the expert observes about the state at time before the agent acts—i.e., . But this denotation is hidden to the agent, whose default interpretation is projected onto —i.e., it is restricted by the agent’s conceptualisation of the domain at time . The expert uses to advise the agent of a better action than the one it performed at . We will see in Section 3.2.1 that using in the expert’s signal minimises its neologisms, which makes learning more tractable. But hidden messages may create misunderstandings; Section 3.2.2 describes how the agent detects and learns from them.

The agent and expert both keep a dialogue history: for the agent and for the expert. Each utterance in a dialogue history is a tuple , where is the speaker (i.e., either the expert or the learning agent), the signal, and its (default) interpretation. Since may be misinterpreted, and may differ and so is not equivalent to is the information about that follows from the signal whatever the denotation of might be. We will specify for each signal in Section 3.2.1.

3.2.1 The expert’s dialogue strategy

As noted earlier, the expert’s dialogue strategy is Cooperative: the message that she intends to convey with her signal is satisfied by the actual decision problem , so that . This makes her sincere, competent and relevant. In many realistic scenarios, this assumption may be untrue (a human teacher, for example, might occasionally make mistakes). We intend to explore relaxations of this assumption in future work.

Further, the expert’s dialogue strategy limits the amount of information she is allowed to send in each signal. There are two motivations for limiting the amount of information. The first stems from the definition itself of our learning task; and the second is practical. First, recall from Section 1 that we allow the expert advice to occur piecemeal, with signals being interleaved among the learner’s domain trials. This is because a major aim of our learning task is to reflect the kind of teacher-apprentice learning seen between humans, where the teacher only occasionally interjects to say things that relate to the learner’s latest attempts to solve the task. There are many tasks where an expert may be incapable of exhaustively expressing everything they know about the problem domain, but rather can only express relevant information in reaction to specific contingencies that they experience.

The second, more practical motivation is to make learning tractable. The number of possible causal structures is hyperexponential in the number of random variables (Buntine, 1991). Prior work utilises defeasible principles such as Conservativity (Bramley et al., 2015) and Minimality (Buntine, 1991) to make inferring tractable. However, if the expert’s signal features a set of variables that the agent was unaware of, then the agent must add each variable in to the causal structure , and moreover by Consistency and Satisfaction, each of these must be connected to ’s utility node. But did not feature these variables at all, and so the number of maximally conservative and minimal updates that satisfy this connectedness is hyperexponential in the size of . Thus, an expert utterance with many neologisms undermines the efficiency and incrementality of learning.

We avoid this complexity by restricting all expert signals to containing at most one neologism. Specifically, the expert must have conclusive evidence that the agent is aware of all but one of the random variables that feature in her signal . The expert knows the agent is aware of a variable if: (a) has already been mentioned in the dialogue, either by her or by the agent; or (b) is an action variable and the agent has performed its positive value (recall that inadvertent action is not possible):


Here, and respectively are the set of chance and action variables that the expert knows the agent is aware of. The following principle, which we refer to as 1N, applies to all the expert’s signals:

At Most One Neologism (1N):

Each expert signal features at most one variable from (i.e., at most one neologism). Furthermore, if features such a variable , then declares its type: i.e., includes the conjunct , or , as appropriate.

The expert uses an ambiguous term to comply with Cooperativity and 1N in contexts where, without , she would be unable express any advice. For instance, assume that the expert’s knowledge of the agent’s “before” vocabulary is given by . The expert would typically be unable to express that, given , the agent should have performed an alternative action to the action that the agent actually performed—by 1N, she cannot use the vector in her signal if . If instead she were to use the vector , where (i.e., projected onto ), then the resulting statement (7) may be false (and so violate Cooperativity), because these expected utilities marginalise over all possible values of , rather than using their actual values:


Alternatively, by replacing in the signal (7) with the ambiguous term , her intended message becomes hidden but it abides by both Cooperativity and 1N.

The expert’s dialogue policy is to answer all the agent’s queries as and when they arise, and to occasionally offer unsolicited advice about a better action. She does the latter when two conditions hold:

  1. The agent has been behaving sufficiently poorly to justify the need for advice

  2. The current context is one where she can express a better option while abiding by Cooperativity and 1N.

Condition (i) is defined via two parameters and : in words, the last piece of advice was offered greater than time steps ago, and from then until now, the fraction of suboptimal actions taken by the agent is greater than . This is formalised in equation (8), where is the time of the last advice and is the optimal action given and . In the experiments in Section 5, we vary and to test how changing the expert’s penchant for offering unsolicited advice affects the learning agent’s convergence on optimal policies. Condition (ii) for giving advice is satisfied when the observed reward is no higher than the expected payoff from the agent’s action (equation (9)), and there is an alternative action with a higher expected payoff, which can be expressed while complying with Cooperativity and 1N (equation (10)).


When the context satisfies these conditions, then there are witness constants and that satisfy (10). These constants are used to articulate the advice: the expert utters (11), where is the expression formed by projecting the vector onto .


In our Barley example, the message (11) might be paraphrased as: in the current circumstances, it would have been better to apply pesticide and not use fertiliser than to not apply pesticide and use fertiliser.

The agent’s default interpretation of (11) is (12), where :


So is added to . While the intended message of (11) is true (because (10) is true), (12) may be false, and so no monotonic entailments about probabilities can be drawn from it. For example, in our Barley example, suppose that the agent’s current model of the domain is Figure 1b, and the agent has observed soil type and precipitation . Then the agent’s defeasible interpretation of the expert’s advice is that it would have been better to apply pesticide () and not use fertiliser () (than to not apply pesticide and use fertiliser) in any state where is true (note that relative to the dn shown in Figure 1b, the expert mentioning applying pesticide leads it to discover this entirely new action):


But this (defeasible) interpretation could be false: the expert’s (true) intended message may have been that is better in a much more specific situation: one where not only is true, but also the local concern is low and the insect prevalence is high (in other words, the probabilities in (13) should have been conditioned on and as well). However, (11) and the mutually known dialogue policy monotonically entails (14), whatever the true referent for might be.


So (14) is added to , and for to be valid it must satisfy it; similarly for .333Our learning algorithms in Section 4 do not exploit the (defeasible) information about likelihood that is expressed in (12); that is a matter for future work. In our example, the agent adds (15) to :


Thus the expert’s unsolicited advice can result in the agent discovering an unforeseen action term (in this case, adding pesticide) and/or prompt a revision to the reward function , which in turn may reveal to the agent that its conceptualisation of the domain is deficient (just as (5) may do).

3.2.2 The agent’s dialogue strategy

The agent aims to minimise the expert’s effort during the learning process, and so asks a question only when dn update fails to discriminate among a large number of dns. In this paper, we identify three such contexts (as shown in Figure 2): (i) Misunderstandings; (ii) Unforeseen Rewards; and (iii) Unknown Effects. We now describe each of these in turn.

dialogue evidence

domain evidence

misunder- standing?

ask (16)

estimate , , ,

unforeseen rewards?


ask (17)

unknown effects?

ask (18)







Figure 2: Information flow when updating a dn with the latest evidence. Diamonds are tests and rectangles are processes.

The agent checks whether the (default) interpretation of the current signal is consistent with those of prior signals; when they are inconsistent, the agent knows there has been a misunderstanding (because of Cooperativity). For instance, suppose —so the agent assumes for all —and the expert advised and . The agent’s default interpretations of these signals fails the consistency test. Thus the agent infers that it is unaware of a variable, but does not know its name. If the agent were to guess what variable to add, then learning would need to support retracting it on the basis of subsequent evidence, or reason about when it is identical to some subsequent factor the agent becomes aware of. This is a major potential complexity in reasoning and learning. We avoid it by defining a dialogue strategy that ensures that the agent’s vocabulary is always a subset of the true vocabulary. Here, that means the agent asks the expert for the name of a variable. In words, question (16) expresses: what is different about the state before I acted just now and the state before I acted time steps ago?444Strictly speaking, answers to this question are partial descriptions of the domain trials as well as . The formal details of this are straightforward but tedious, so we gloss over it here.


This signal uses a standard representation for questions (Groenendijk and Stokhof, 1982): is an operator that maps a function to a set of propositions (the true answers), and the function’s arguments correspond to the wh-elements of the question (which, what etc). Its semantics, given in the Appendix, follows (Asher and Lascarides, 2003): a proposition is a true answer if and only if it substitutes specific individual(s) for the -term(s) to create a true proposition—so “something” is not an answer but “nothing” is an answer, corresponding to the empty set. An answer can thus be non-exhaustive—it needn’t include all the referents that satisfy the body of the -expression.

The agent’s policy for when to ask (16) guarantees that it has a true positive answer. By the 1N rule, the expert’s answer includes one variable, chosen at random if there is a choice. E.g., she answers with the signal ; so is added to . In our barley example, the answer might be, for instance, “the temperature was high”, where the concept “temperature” is a neologism to the learning agent.

The agent now knows that its default interpretations of prior signals that feature () are incorrect—on learning , it now knows that these terms do not denote . But the agent cannot observe the correct interpretation. Testing whether subsequent messages are consistent with prior messages would thus involve reasoning about past hidden values of , which would be complex and fallible. For the sake of simplicity, we avoid it: if the agent discovers a new -variable at time , then she ceases to test whether subsequent messages are consistent with past ones (). In effect, the agent “forgets” past signals and their (default) interpretations, but does not forget their monotonic consequences, which are retained in .

Unforeseen rewards

The current reward function’s domain, , may be too small to be valid (see Sections 2 and 3.1). If the agent were to guess what variable to add to , then learning would need to support retracting it on the basis of subsequent evidence—this is again a major potential complexity in reasoning and learning. Instead, when the agent infers is too small (we define how the agent infers this in Section 4), the agent seeks monotonic evidence for fixing it by asking the expert (17):


For instance, in the barley domain, this question could express something like “Other than having a high yield and high protein crops, what else do I care about?” to which the answer might be “You care about bad publicity”. This is a true (non-exhaustive) answer, even though the agent (also) cares about avoiding outbreaks of fungus.

Non-exhaustive answers enable the expert to answer the query while abiding by Cooperativity and the 1N rule. But they also generate a choice on which answer to give. The expert’s choice is driven by a desire to be informative: she includes in her answer all variables in that she knows the agent is aware of (i.e., the variables ) and one potential neologism (i.e., ) if it exists, with priority given to a variable whose value is different in the latest trial from a prior trial that had the same values as on all the variables mentioned in the query (17) but a different reward. If her answer includes a potential neologism, then by 1N she declares its type.

For example, let and . Suppose that but . Now suppose that the latest trial entails and reward of 1, but a prior trial entails and a reward of 0.5. Then even if has different values in these trials, the agent does not infer (defeasibly) that . Rather, she asks (17): i.e., . The expert’s answer includes because . In addition, she can mention either or , but not both (because of 1N). Suppose that entails and a prior trial with a different reward entails . However, lacks this property. Then her answer is: .

The ambiguous term is not a part of the expert’s answer, and so the message is observable—indeed, it is the same as the signal. This becomes a conjunct in , and so in our example, for to be valid it must satisfy and (and so and if truly was a neologism to the agent then as well). Section 4 defines how to estimate a valid from this evidence.

Unknown Effects

There are contexts where the search space of possible causal structures remains very large in spite of the defeasible principles for restricting it, in which case the agent asks a question whose answer will help to restrict the search space:


In words, what does affect? For instance, in the barley domain, it might express: What does the temperature affect? (to which an answer might be the risk of weeds). In Section 4 we will define precisely the contexts in which the agent asks this question, including which variable it asks about. The expert’s answer gives priority to a variable that she believes the agent is unaware of. This increases the chances that the agent will learn potentially valuable information about the hypothesis space.

The expert’s answer has an observable interpretation; is added to and since , . Thus some dependencies in

are inferred monotonically via observable expert messages. Others are inferred defeasibly via statistical pattern recognition in the domain trials (see Section 


4 The Model for Learning

We now define dn update in a way that meets the criteria from Section 2. We must estimate all components of the dn from the latest evidence , in a way that satisfies the partial description that has accumulated so far (this is required to make the dn valid). That is, one must estimate the set of random variables and , the domain of the reward function as well as the function itself, the dependencies among , and the conditional probability tables (cpts) , given . These components are estimated from the prior dn , the latest evidence and the partial description in the following order, as shown in Figure 2:

    1. Estimate , , and ;

    2. Estimate , given ;

    1. Estimate , given and ;

    2. Estimate , given .

Step 1 proceeds via constraint solving (see Section 4.1); step 2 via a combination of symbolic and statistical inference (see Section 4.2). Update proceeds in this order because the set of dependencies one must deliberate over depends on the set of random variables it is defined over, and the global constraint on dependencies—namely, that the utility node is a descendant of all nodes in the dn—must be defined in terms of the domain of the reward function. Likewise, the set of cpts that one must estimate is defined by the dependency structure . The search for a valid dn can prompt backtracking, however: e.g., failure to derive a valid dependency structure may ultimately lead to a re-estimate of the set of random variables . We now proceed to describe in detail each of these components of dn update.

4.1 Random Variables and Reward Function

The first step in dn update is to identify ’s random variables—that is, the sets , and —and the reward function . This is achieved via constraint solving, with the constraints provided by .

The number of valid vocabularies is always unbounded, because any superset of a valid vocabulary is valid. As motivated in Section 2, we make search tractable via greedy search for a minimal valid vocabulary: the agent (defeasibly) infers the vocabulary in (19a–c) (this covers all the variables the agent is aware of thanks to the dialogue strategy from Section 3.2), and also defeasibly infers the minimal domain for , defined in (19d).


The evidence described in Section 3 yields three kinds of formulae in that (partially) describe : (5), (14), and . The agent uses constraint solving to find, or fail to find, a reward function that satisfies all conjuncts in of the form (5) and (14), plus the constraint (20), which will ensure is well-defined with respect to its (estimated) domain (19d).


Constraints of the form (5) and (14) are skolemized and fed into an off-the-shelf constraint solver (with replaced with ). The possible denotations of these skolem constants are the atomic states defined by the vocabulary (19a–c). If there is a solution, the constraint solver returns for each skolem constant a specific denotation .

Substituting the skolem terms with their denotations projected onto yields equalities (from (5)) and inequalities (from (14)), where . A complete function is constructed from this partial function by defaulting to indifference (recall Minimality): for any where there is an inequality but no equality , we set for some constant (in our experiments, ); for any for which there are no equalities or inequalities, we set .

The constraint solver may yield no solution: i.e., there is no function with the currently estimated domain (19d) satisfying all observed evidence about states and their rewards. This is the context “unforeseen rewards” described in Section 3.2 (see also Figure 2): the agent asks (17), deferring dn update until it receives the expert’s answer. The expert’s answer is guaranteed to provide a new variable to add to . It may also be a neologism—a variable the agent was unaware of—and so after updating with the expert’s answer, the agent backtracks to re-compute (19a–d).

4.2 Estimating

Current approaches to incrementally learning in a graphical model of belief exploit local inference to make the task tractable. There are essentially two forms of local inference.

The first is a greedy local search over a full structure: remove an edge from the current dag or add an edge that does not create a cycle, then test whether the result has a higher likelihood given the evidence (e.g., Bramley et al. (2015); Friedman and Goldszmidt (1997)). This is Conservative: changes only when evidence justifies it. However, adapting it to our task is problematic. Firstly, such techniques rely heavily on a decent initial dag to avoid getting stuck in a local maximum; but in our task the agent’s initial unawareness of the possibilities makes an initial decent dag highly unlikely. Secondly, removing an edge can break the global constraint on dns that all nodes connect to the utility node. We would need to add a third option of doing local search given : replace one edge in with another edge somewhere else. This additional option expands the search space considerably.

We therefore adopt the alternative form of local inference: assume conditional independence among parent sets. Buntine (1991) assumes that ’s parent set is conditionally independent of ’s given evidence and the total temporal order over the random variables— means that may be a parent to but not vice versa. This independence assumption on its own is not sufficient for making reasoning tractable, however. If a Bayes Net has 21 variables (as the ones we experiment with in Section 5 do), then a variable may have possible parent sets—a search space that is too large to be manageable. So Buntine (1991) prunes ’s possible parent sets to those that evidence so far makes reasonably likely (we will define “reasonably likely” in a precise way shortly). There are then two alternative ways of updating:

  • Parameter Update

    : Estimate the posterior probability of a parent set from its prior probability and the

    latest evidence under an assumption that the set of reasonable parent sets does not change.

  • Structural Update: Review and potentially revise which parent sets are reasonable, given a batch of evidence. (Thus, Structural Update changes the set of possible structures which are considered).

We adapt Buntine’s model to our task in two ways. Firstly, Buntine’s model assumes that the total ordering on variables is known. In our task the total order is hidden, and marginalising over all possible temporal orders is not tractable—it is exponential on the size of the vocabulary. We therefore make an even stronger initial independence assumption than Buntine when deciding which parent sets are reasonable: Specifically, that ’s parent set is conditionally independent of ’s given evidence alone. Unfortunately, this allows combinations of parent sets with non-zero probabilities to be cyclic. We therefore have an additional step where we greedily search over the space of total orderings (similar to (Friedman and Koller, 2003)), and use

Integer Linear Programming

(Vanderbei, 2015) at each step to find a “most likely” causal structure which both obeys the currently proposed ordering and that is also a valid dn—in particular, it is a dag where the utility node is a descendant of all other nodes.

Secondly, as the agent’s vocabulary of random variables expands, we need to provide new probability distributions over the larger set of possible parent sets, which in turn will get updated by subsequent evidence. We now describe each of these components of the model in turn.

4.2.1 Parameter Update

Each variable is associated with a set of reasonable parent sets. Each parent set is some combination of variables from . Parameter Update determines the posterior distribution over given its prior distribution and the latest piece of evidence under the assumption that the possible values of do not change.

Updates from Domain Trials

Suppose that the latest evidence is a domain trial (i.e., ). Parameter Update uses Dirichlet distributions to support incremental learning: if and , then we can calculate the posterior probability of in a single step using (21):


Here, is the number of trials in where and , and is a “pseudo-count” which represents the Dirichlet parameter for and . The sum of all trials where is given by . Formula (21) follows from the recursive structure of the function in the Dirichlet distribution (the Appendix provides a derivation, which corrects an error in (Buntine, 1991)).

Estimating , given , likewise exploits the Dirichlet distribution (Buntine, 1991, p56):


In words, (22) computes the conditional probability tables (cpts) directly from the counts in the trials and the appropriate Dirichlet parameters, which in turn quantify the extent to which one should trust the counts in the domain trials for estimating likelihoods—the higher the value of the s relative to the s, the less the counts influence the probabilities. Note that the -parameters vary across the values of the variables and their (potential) parent sets. We motivate this shortly, when we describe how to perform dn update when a new random variable needs to be added to it. At the start of the learning process, for all , and .

Updates from Expert Evidence

Now suppose is an expert utterance , but not one that introduces a neologism (we discuss dn update over an expanded vocabulary at the end of this section). Then Parameter Update starts by using -elimination on (i.e., the partial description of that entails) to infer a conjunction of all conjuncts in of the form . Note that does not contain conjuncts of the form (see Section 3.2), although it may entail such formulae (e.g., from conjuncts declaring a variable’s type). Formula (23) then computes a posterior distribution over , where is a normalising factor:


Equation (23) is Conservative—it preserves the relative likelihood of parent sets that are consistent with . It offers a way to rapidly and incrementally update the probability distribution over with at least some of the information that is revealed by the latest expert evidence.

Enforcing global constraints on

Equations (21) and (23) on their own do not comply with Satisfaction nor even with Consistency. If we naively construct the “most likely” global structure by simply picking the highest probability parent set for each variable independently, their combination might be cyclic, or violate information about parenthood entailed by .

To find a global structure which is valid as well as likely, we combine ILP techniques with a greedy local search over the space of total orderings.

We first compute from , where is a total temporal order over ’s random variables that satisfies the partial order entailed by . Next, we use ILP techniques to determine the most likely structure which obeys the current ordering , and is a dag with all nodes connected to the utility node. We stop searching when all valid total orders formed via a local change to yield a less likely structure (i.e., ), and return as . We now describe these steps in detail.

The partial description imposes a partial order on the random variables, thanks to the expert’s declarations of the form (this corresponds to condition (i) in Fact 2), its entailments about each variable’s type (i.e., , or ), which are in effect constraints on the relative position of variables within the dag (conditions (ii)–(iv) in Fact 2); and information about whether a variable is an immediate parent to the utility node (condition (v)):

Fact 0

Total orders that satisfy
A total order of random variables satisfies the partial order entailed by iff it satisfies the following 5 conditions:

  1. If , then ;

  2. If then for any , and for any where ;

  3. If then there is a variable where and ;

  4. If then ; and

  5. for any , if and , then .

Note that thanks to and the dialogue strategies, where is , , , or , one can test whether via -elimination.

The agent starts by choosing (at random) a total order that satisfies Fact 2. The agent then uses (24) to estimate the probabilities over direct parenthood relations ( means is a parent to ) given the evidence and :


Thanks to Fact 2 this makes any combination of non-zero probability parenthood relations a dag with no -variable being a descendant to an variable, and all variables have no parents. But it does not guarantee that all nodes are connected to the utility node. We use an ILP step to impose this global constraint. The result is a valid structure (i.e., it satisfies ) that evidence so far also deems to be likely (we give an outline proof of its validity on page 3).

The ILP Step is formally defined as follows:

Decision variables:

Where , is a Boolean variable with value 1 if is a parent of (i.e., ), and 0 otherwise

Objective Function:

We want to find the most likely combination of valid parenthood relations: i.e., we want to solve (25):


C1 ensures three things: (i) variables cannot have a parent which is not present in at least one of the reasonable parent sets (see (24)); (ii) parent sets obey all expert declarations of the form (by equation (23)); and (iii) parents obey the currently proposed temporal ordering