Theory-based Causal Transfer: Integrating Instance-level Induction and Abstract-level Structure Learning

11/25/2019 ∙ by Mark Edmonds, et al. ∙ 0

Learning transferable knowledge across similar but different settings is a fundamental component of generalized intelligence. In this paper, we approach the transfer learning challenge from a causal theory perspective. Our agent is endowed with two basic yet general theories for transfer learning: (i) a task shares a common abstract structure that is invariant across domains, and (ii) the behavior of specific features of the environment remain constant across domains. We adopt a Bayesian perspective of causal theory induction and use these theories to transfer knowledge between environments. Given these general theories, the goal is to train an agent by interactively exploring the problem space to (i) discover, form, and transfer useful abstract and structural knowledge, and (ii) induce useful knowledge from the instance-level attributes observed in the environment. A hierarchy of Bayesian structures is used to model abstract-level structural causal knowledge, and an instance-level associative learning scheme learns which specific objects can be used to induce state changes through interaction. This model-learning scheme is then integrated with a model-based planner to achieve a task in the OpenLock environment, a virtual “escape room” with a complex hierarchy that requires agents to reason about an abstract, generalized causal structure. We compare performances against a set of predominate model-free reinforcement learning(RL) algorithms. RL agents showed poor ability transferring learned knowledge across different trials. Whereas the proposed model revealed similar performance trends as human learners, and more importantly, demonstrated transfer behavior across trials and learning situations.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The ability of agents to learn and reuse knowledge is a fundamental characteristic of general intelligence and is essential for agents to succeed in novel circumstances [15]. Humans demonstrate a remarkable ability to transfer causal knowledge between environments governed by the same underlying mechanics, in spite of observational changes to the features of the environment [6]. Early psychological research framed causal understanding as learning stimulus-response relationships through observation in classical conditioning experimental paradigms [27, 23]. However, more recent studies show human understanding of causal mechanisms in the distal world is more complex than covariation between observed (perceptual) variables [13]; e.g., humans explore and experiment with dynamic physical scenarios to refine causal hypotheses [3, 29].

Figure 1: (a) Starting configuration of a 3-lever OpenLock room. The arm can interact with levers by either pushing outward or pulling inward, achieved by clicking either the outer or inner regions of the levers’ radial tracks, respectively. Light gray levers are always locked; however, this is unknown to agents. The door can be pushed only after being unlocked. The green button serves as the mechanism to push on the door. The black circle on the door indicates whether or not the door is unlocked; locked if present, unlocked if absent. (b) Pushing on a lever. (c) Opening the door.

Since the associative account, researchers have demonstrated that humans uncover causal relationships through the discovery of abstract causal structure [32] and causal strength [4]

. Simultaneously, causal graphical models and Bayesian statistical inference have been developed to provide a general representational framework for how causal structure and strength are discovered 

[9, 10, 30, 2, 1, 13]. Under such a framework, causal connections encode a structural model of the world. States represent some status in the world, and connections between states imply the presence of a causal relationship. However, a critical component in causal learning is active interaction with the physical world, based on whether perceived information matches predictions from causal hypotheses. In this work, we combine causal learning (a form of model-building) with a model-based planner to effectively achieve tasks in environments where dynamics are unknown.

In contrast to this work beyond the associative account of causal understanding, recent success in the field of deep reinforcement learning (RL) has produced a wide body of research, showcasing agents learning how to play games [20, 28, 25, 26] and develop complex robotic motor skills [16, 17] using associative learning schemes. However, the majority of model-free RL methods still have great difficulty transferring learned policies to new environments with consistent underlying mechanics but some dissimilar surface features [35, 14]. This deficiency is due to the limited scope of the agent’s overall objective: learning which actions will likely lead to future rewards based on the current state of the environment. In traditional RL architectures, changes to the location and orientation of critical elements (instance-level) in the agent’s environment appear as entirely new states, even though their functionality often remains the same (in the abstract-level). Since model-free RL agents do not attempt to encode transferable rules governing their environment, new situations appear as entirely new worlds. Although an agent can devise expert-level strategies through experiences in an environment, once that environment is perturbed, the agent must repeat an extensive learning process to relearn an effective policy in the altered environment.

In this work, the transfer learning problem is viewed as a combination of instance-level associative learning and abstract-level causal learning. We propose: (i) a bottom-up associative learning scheme that determines which attributes are associated with changes in the environment, and (ii) a top-down causal structure learning scheme that infers which atomic causal structures are useful for a task. The outcomes of actions are used to update beliefs about the causal hypothesis space, and our agent learns a dynamics model capable of solving our task. Specifically, we utilize a virtual “escape room” where agents are trapped in an empty room with a locked door. There is a series of conspicuous levers placed around the room with which an agent may interact. Agents placed in such a room may randomly push or pull on the levers to revise their theory about the door’s locking mechanism based on observed changes in the environment’s state. Once an agent discovers a solution, the agent is placed back into the same room but tasked with finding the next (different) solution. The agent “escapes” from the room after finding all of the solutions that can be used to unlock the door.

After completing (escaping) a single room, the agent is placed into a similar room, but with newly positioned levers. Although the levers are in different positions, the rules governing this new room are the same as the last. Thus, the agent’s task is to identify the role of each lever, according to the previously learned rules. Because these rules are abstract descriptions of the latent state of the escape room, we refer to the underlying theory as a causal schema [11]; i.e., a conceptual organization of events identified as cause and effect. Once learned, an agent is able to transfer the learned schema despite different arrangements of levers in the room. Finally, we task agents with transferring knowledge with a different but similar causal schema. The new schema may add additional levers (nodes in a graphical model) or, in a more challenging way, rearrange the structure.

This paper integrates multiple modeling approaches to produce a highly capable agent that can learn causal schemas and transfer knowledge to new scenarios. The contribution of this paper is threefold:

  1. [leftmargin=*,noitemsep,nolistsep]

  2. Learning a bottom-up associative theory that encodes which objects and actions contribute to causal relations;

  3. Learning which top-down atomic causal schemas are solutions, thereby learning generalized abstract task structure;

  4. Integrating the top-down and bottom-up learning scheme with a model-based planner to optimally select interventions from causal hypotheses.

The remainder of this paper is organized as follows: Section 2 describes the OpenLock task. We present the proposed method of causal theory induction and intervention selection in Section 3 and Section 4, respectively. Section 5 compares the performance of the proposed model against various RL algorithms. Section 6 concludes the paper with discussions.

2 OpenLock Task

The OpenLock task, originally presented in edmonds2018human edmonds2018human, requires agents to “escape” from a virtual room by unlocking and opening a door. The door is unlocked by manipulating the levers in a particular sequence (see Fig. 1a). Each lever can be manipulated using the robotic arm to push or pull on levers. Only a subset of the levers, specifically grey levers, are involved in unlocking the door (i.e., active levers). White levers are never involved in unlocking the door (i.e., inactive levers); however, this information is not provided to agents. Thus, at the instance-level, agents are expected to learn that grey levers are always part of solutions and white levers are not. Agents are also tasked with finding all solutions in the room, instead of a single solution.

Figure 2: (a) Common Cause 3 (CC3) causal structure. (b) Common Effect 3 (CE3) causal structure. (c) Common Cause 4 (CC4) causal structure. (d) Common Effect 4 (CE4) causal structure. , , denote different locks, and the door.

Schemas: The door locking mechanism is governed by two causal schemas: Common Cause (CC) and Common Effect (CE). We use the terms Common Cause 3 (CC3) and Common Effect 3 (CE3) for schemas with three levers involved in solutions, and Common Cause 4 (CC4) and Common Effect 4 (CE4) with four levers; see Fig. 2. Three-lever trials have two solutions; four-lever trials have three solutions. Agents are required to find all solutions within a specific room to ensure that they form either CC or CE schema structure; a single solution corresponds to a causal chain.

Constraints: Agents also operate under an action-limit constraint, where only 3 actions (referred to as an attempt) can be used to (i) push or pull on (active or inactive) levers, or (ii) push open the door. This action-limit constraint prevents the search depth of interactions with the environment. After 3 actions, regardless of the outcome, the attempt terminates, and the environment resets. Regardless of whether the agent finds all solutions, agents are also constrained to a limited number of attempts in a particular room (referred to as a trial; i.e., a sequence of attempts in a room, resulting in finding all the solutions or running out of attempts). An optimal agent will use at most attempts to complete a trial, where is the number of solutions in the trial. One attempt would be used to identify the role of every lever in the abstract schema, and attempts would be used for each solution.

Training: Training sessions contain only 3-lever trials. After finishing a trial, the agent is placed in another trial (i.e., room) with the same underlying causal schema but with a different arrangement of levers. If agents are forming a useful abstraction of task structure, the knowledge they acquired in previous trials should accelerate their ability to find all solutions in the present and future trials.

Transfer: In the transfer phase, we examine agents’ ability to generalize the learned abstract causal schema to different but similar environments. We use four transfer conditions consisting of (i) congruent cases where the transfer schema adopts the same structure but with an additional lever (CE3-CE4 and CC3-CC4), and (ii) incongruent cases where the underlying schema is changed with an additional lever (CC3-CE4 and CE3-CC4). We compare these transfer results against two baseline conditions (CC4 and CE4), where the agent is trained in a sequence of 4-lever trials.

While seemingly simple, this task is unique and challenging for several reasons. First, requiring the agent to find all solutions rather than a single solution enforces the task as a CC or CE structure, instead of a single causal chain. Second, transferring the agent between trials with the same underlying causal schema but different lever positions encourages efficient agents to learn an abstract representation of the causal schema, rather than learning instance-level

policies tailored to a specific trial. We would expect agents unable to form this abstraction to perform poorly in any transfer condition. Third, the congruent and incongruent transfer conditions test how well agents are able to adapt their learned knowledge to different but similar causal circumstances. These characteristics of the OpenLock task present challenges for current machine learning algorithms, especially model-free

RL algorithms.

3 Causal Theory Induction

Causal theory induction provides a Bayesian account of how hierarchical causal theories can be induced from data [9, 10, 30]. The key insight is: hierarchy enables abstraction. At the highest level, a theory provides general background knowledge about a task or environment. Theories consist of principles, principles lead to structure, and structure leads to data. The hierarchy used here is shown in Fig. 3a. Our agent utilizes two theories to learn a model of the OpenLock environment: (i) an instance-level associative theory regarding which attributes and actions induce state changes in the environment, denoted as the bottom-up theory, and (ii) an abstract-level causal structure theory about which atomic causal structures are useful for the task, denoted as the top-down theory.

Figure 3: Illustration of top-down and bottom-up processes. (a) Abstract-level structure learning hierarchy. At the top, atomic schemas provide the agent with environment-invariant task structures. At the bottom, causal subchains represent a single time-step in the environment. The agent constructs the hierarchy and makes decisions at the causal subchain resolution. Atomic schemas provide the top-level structural knowledge. Abstract schemas are structures specific to a task, but not a particular environment. Instantiated schemas are structures specific to a task and a particular environment. Causal chains are structures representing a single attempt; an abstract, uninstantiated causal chain is also shown for notation. Each subchain is a structure corresponding to a single action. PL, PH, L, U denote fluents pulled, pushed, locked, and unlocked, respectively. (b) The subchain posterior computed using the abstract-level structure learning and instance-level inductive learning. (c) Instance-level inductive learning. Each likelihood term is learned from causal events, . Likelihood terms are combined for actions, positions, and colors.

Notation, Definition, and Space: A hypothesis space, , is defined over possible causal chains, . Each chain is defined as a tuple of subchains: , where is the length of the chain, and each subchain is defined as a tuple . Each is an action node that the agent can execute, is a state node, is a causal relation that defines how a state transitions under an action , and is a causal relation that defines how state is affected by changes to the previous state, . Each is defined by a set of time-invariant attributes, and time-varying fluents,  [31, 18, 21]; i.e., . Action nodes can be directly intervened on, but state nodes cannot. This means an agent can directly influence (i.e., execute) an action, but how the action affects the world must be actively learned. The structure of the general causal chain is shown in the uninstantiated causal chain in Fig. 3a. As an example using Fig. 1a and the first causal chain in the causal chain level of Fig. 3a, if the agent executes push on the upper lever, the lower lever may transition from pulled to pushed, and the left lever may transition from locked to unlocked.

The space of states is defined as , where the space of attributes consists of position and color, and the space of fluents consists of binary values for lever status (pushed or pulled) and lever lock status (locked or unlocked). The space of causal relations is defined as , capturing the possibly binary transitions between previous fluent values and the next fluent values.

State nodes encapsulate both the time-invariant (attributes) and time-varying (fluents) components of an object. Attributes are defined by low-level features (e.g., position, color, and orientation). These low-level attributes provide general background knowledge about how specific objects change under certain actions; e.g., which levers can be pushed/pulled.

Method Overview: Our agent induces instance-level knowledge regarding which objects (i.e., instances) can produce causal state changes through interaction (see Section 3.1) and simultaneously learns an abstract structural understanding of the task (i.e., schemas; see Section 3.2). The two learning mechanisms are combined to form a causal theory of the environment, and the agent uses this theory to reason about the optimal action to select based on past experiences (i.e., interventions; see Section 4). After taking an action, the agent observes the effects and updates its model of both instance-level and abstract-level knowledge.

3.1 Instance-level Inductive Learning

The agent seeks to learn which instance-level components of the scene are associated with causal events; i.e

., we wish to learn a likelihood term to encode the probability that a causal event will occur. We adhere to a basic yet general associative learning theory:

causal relations induce state changes in the environment, and non-causal relations do not, referred to as the bottom-up theory. We learn two independent components: attributes and actions, and we assume they are independent to learn a general associative theory, rather than specific knowledge regarding an exact causal circumstance.

We define , the space of attributes, such as position and color, and learn which attributes are associated with levers that induce state changes in the environment. Specifically, an object is defined by its observable features; i.e., the attributes . We also define , a set of actions and learn a background likelihood over which actions are more likely to induce a state change. We assume attributes and actions are independent and learn each independently.

Our agent learns a likelihood term for each attribute and action

using Dirichlet distributions because they serve as a conjugate prior to the multinomial distribution. First, a global Dirichlet parameterized by

is used across all trials to encode long-term beliefs about various environments. Upon entering a new trial, a local Dirichlet parameterized by is initialized to , where is a normalizing factor. Such design of using a scaled local distribution is necessary to allow to adapt faster than within one trial; i.e., agents must adapt more rapidly to the current trial compared to across all trials. Thus, we have a set of Dirichlet distributions to maintain beliefs: a Dirichlet for each attribute (e.g., position, and color) as well as a Dirichlet for actions. Similarly, we maintain a Dirichlet distribution over each action to encode beliefs regarding which actions are more likely to cause a state change, independent from any particular circumstance.

We introduce to represent a causal event or observation occurring in the environment. Our agent wishes to assess the likelihood of a particular causal chain producing a causal event. The agent computes this likelihood by decomposing the chain into subchains


where is formulated as


where and follow multinomial distributions parameterized by a sample from the attribute and action Dirichlet distribution, respectively.222See supplementary materials for additional details. Intuitively, this bottom-up associative likelihood encodes a naive Bayesian prediction of how likely a particular subchain is to be involved with any causal event by considering how frequently the attributes and actions have been in causal events in the past, without regard for task structure. For example, we would expect an agent in OpenLock to learn that grey levers move under certain circumstances and white levers never move. This instance-level learning provides the agent with task-invariant, basic knowledge about which subchains are more likely to produce a causal effect.

3.2 Abstract-level Structure Learning

In this section, we outline how the agent learns abstract schemas; these schemas are used to encode generalized knowledge about task structure that is invariant to a specific observational environment.

A space of atomic causal schemas, , of causal chain, CC, and CE, serve as categories for the Bayesian prior. The belief in each atomic schema is modeled as a multinomial distribution, whose parameters are defined by a Dirichlet distribution. This root Dirichlet distribution’s parameters are updated after every trial according to the top-down causal theory , computed as the minimal graph edit distance between an atomic schema and the trial’s solution structure. This process yields a prior over atomic schemas, denoted as , and provides the prior for the top-down inference process. Such abstraction allows agents to transfer beliefs between the abstract notions of CC and CE without considering task-specific requirements; e.g., 3- or 4-lever configurations.

Next, we compute the belief in abstract instantiations of the atomic schemas. These abstract schemas share structural properties with atomic schemas but have a structure that matches the task definition. For instance, each schema must have three subchains to account for the 3-action limit imposed by the environment and should have trajectories, where is the number of solutions in the trial. Each abstract schema is denoted as , and the space of abstract schemas, denoted , is enumerated. The belief in an abstract causal schema is computed as


The abstract structural space can be used to transfer beliefs between rooms; however, we need to perform inference over settings of positions and colors in this trial as the agent executes. Thus, the agent enumerates a space of instantiated schemas , where each is an instantiated schema. The agent then computes the belief in an instantiated schema as


where represents the operator [22], and represents the solutions already executed. Conditioning on constrains the space to have instantiated solutions that contain the solutions already discovered by the agent in this trial. Causal chains define the next lower level in the hierarchy, where each chain corresponds to a single attempt. The belief in a causal chain is computed as


Finally, the agent computes the belief in each possible subchain as


where represents the intervention of performing the action sequence executed thus far in this attempt , and performing all solutions found thus far . This hierarchical process allows the agent to learn and reason about abstract task structure, taking into consideration the specific instantiation of the trial, as well as the agent’s history within this trial.footnotemark:

Additionally, if the agent encounters an action sequence that does not produce a causal event, the agent prunes all chains that contain the action sequence from and prunes all instantiated schemas that contain the corresponding chain from . This pruning strategy means the agent assumes the environment is deterministic and updates its theory about which causal chains are causally plausible through interactions on-the-fly.

4 Intervention Selection

Our agent’s goal is to pick the action it believes has the highest chance of (i) being causally plausible in the environment and (ii) being part of the solution to the task. We decompose each subchain into its respective parts, . The agent combines the top-down and bottom-up processes into a final subchain posterior:


Next, the agent marginalizes over causal relations and states to obtain a final, action-level term to select interventions:


The agent uses a model-based planner to produce action sequences capable of opening the door (following human participant instructions in [6]). The goal is defined as reaching a particular state , and the agent seeks to execute the action to maximize the posterior subject to the constraints that the action appears in the set of chains that satisfy the goal, . We define the set of actions that appear in chains satisfying the goal as . The agent’s final planning goal is


At each time-step, the agent selects the action that maximizes this planning objective and updates its beliefs about the world as described in Section 3.1 and Section 3.2. This iterative process consists of optimal decision-making based on the agent’s current understanding of the world, followed by updating the agent’s beliefs based on the observed outcome.

5 Experiments

We compare results between predominate model-free RL algorithms with the proposed theory-based causal transfer model. Specifically, we compare the proposed method against Deep Q-Network (DQN[20], DQN with prioritized experience replay (DQN (PE)[24], Advantage Actor-Critic (A2C[19], Trust Region Policy Optimization (TRPO[25], Proximal Policy Optimization (PPO[26], and Model-Agnostic Meta-Learning (MAML[7] agents. We use the term positive transfer and negative transfer to indicate that agent performance benefits from or is hindered by the training phase, respectively.

5.1 Experimental Setup

The proposed model follows the same procedure as the one used for human studies presented in edmonds2018human edmonds2018human. Baseline (no transfer) agents are placed in 4-lever scenarios for all trials. Transfer agents are evaluated in two phases: training and transfer. For every training trial, the agent is placed into a 3-lever trial and allowed 30 attempts to find all solutions. In the transfer phase, the agent is tasked with a 4-lever trial. Critically, the agent only sees each trial (room) one time, so generalizations must be formed quickly to transfer between trials successfully. See Section 2 for more details.

When executing various model-free RL agents under this experimental setup, no meaningful learning takes place. Instead, we train RL agents by looping through all rooms repeatedly (thereby seeing each room multiple times). Agents are also allowed 700 attempts in each trial to find all solutions. During training, agents execute for 200 training iterations, where each iteration consists of looping through all six 3-lever trials. During transfer, agents execute for 200 transfer iterations, where each iteration consists of looping through all five 4-lever trials. Note that the setup for RL agents is advantageous; in comparison, both the proposed model and human subjects are only allowed 30 attempts (versus 700) during the training and 1 iteration (versus 200) for transfer.


agents operate directly on the state of the simulator encoded as a 16-dimensional binary vector: (i) the status of each of the 7 levers (

pushed or pulled), (ii) the color of each of the 7 levers (grey or white), (iii) the status of the door (open or closed) and (iv) the status of the door lock indicator (locked or unlocked). The 7-dimensional encoding of the status and color of each lever encodes the position of each lever; e.g., the 0-th index corresponds to the upper-right position. Despite direct access to the simulator’s state, RL approaches were unable to form a transferable task abstraction.

Additionally, we utilized a plethora of reward functions to explore under what circumstances these RL approaches may succeed. Our agents used sparse reward functions, shaped reward functions, and conditional reward functions that encourage agents to find unique solutions.333See supplementary materials for the numerous architectures, parameters, and reward functions used. A reward function that only rewards for unique solutions performed best, meaning agents were only rewarded the first time they found a particular solution. This is similar to the human experimental setup, under which participants were informed when they found a solution for the first time (thereby making progress towards the goal of finding all solutions) but were not informed they executed the same solution multiple times (thereby not making progress towards the goal).

Figure 4: RL results for baseline and transfer conditions. Baseline (no transfer) results show the best-performing algorithms (PPO, TRPO) achieving approximately 10 and 25 attempts by the end of the baseline training for CC4 and CE4, respectively. A2C is the only algorithm to show positive transfer; A2C performed better with training for the CC4 condition. The last 50 iterations are not shown due to the use of a smoothing function.

5.2 Reinforcement Learning Results

The model-free RL results, shown in Fig. 4, demonstrate that A2C, TRPO, and PPO are capable of learning how to solve the OpenLock task from scratch. However, A2C in the CC4 condition is the only agent showing positive transfer; every other agent in every condition shows negative transfer.

These results indicate that current model-free RL algorithms are capable of learning how to achieve this task; however, the capability to transfer the learned abstract knowledge is markedly different compared to human performance in edmonds2018human edmonds2018human. Due to the overall negative transfer trends shown by nearly every RL agent, we conclude that these RL algorithms cannot capture the correct abstractions to transfer knowledge between the 3-lever training phase and the 4-lever transfer phase. Note that the RL algorithms found the CE4 condition more difficult than CC4, a result also shown in our proposed model results and human participants.

5.3 Theory-based Causal Transfer Results

Figure 5: Model performance vs. human performance. (a) Proposed model baseline results for CC4/CE4. We see an asymmetry between the difficulty of CC and CE. (b) Human baseline performance [6]. (c) Proposed model transfer results for training in CC3/CE3. The transfer results show that transferring to an incongruent CE4 condition (i.e., different structure, additional lever; i.e., CC3 to CE4) was more difficult than transferring to a congruent condition (i.e., same structure, additional lever; i.e., CE3 to CE4). However, the agent did not show a significant difference in difficulty when transferring to congruent or incongruent condition for the CC4 transfer condition. (d) Human transfer performance [6].

The results using the proposed model are shown in Fig. 5. These results are qualitatively and quantitatively similar to the human participant results presented in edmonds2018human edmonds2018human, and starkly different from the RL results. We execute 40 agents in each condition, matching the number of human subjects described in edmonds2018human edmonds2018human.

Our agent does not require looping over trials multiple times; it is capable of learning and generalizing from seeing each trial only one time. In the baseline agents, the CE4 condition was more difficult than CC4; this trend was also observed in human participants. During transfer, we see a similar performance as the baseline results; however, for congruent cases (transferring from the same structure with an additional lever) were easier than incongruent cases (transferring to a different structure with an additional lever; CE4 transfer); this result was statistically significant for CE4: ; . For CC4 transfer, no significance was observed (; ), indicating both CC3 and CE3 obtained near-equal performance when transferred to CC4.

These learning results are significantly different from the RL results; the proposed causal theory-based model is capable of learning the correct abstraction using instance and structural learning schemes, showing similar trends as the human participants. It is worth noting that RL agents were trained under highly advantageous settings. RL agents: (i) were given more attempts per trial; and (ii) more importantly, were allowed to learn in the same trial multiple times. In contrast, the present model learns the proper mechanisms to: (i) transfer knowledge to structurally equivalent but observationally different scenarios (baseline experiments); (ii) transfer knowledge to cases with structural differences (transfer experiments); and (iii) do so using the same experimental setup as humans. The model achieves this by understanding which scene components are capable of inducing state changes in the environment while leveraging overall task structure.444For additional model results and ablations, see supplementary.

6 Conclusion and Discussion

In this work, we show how the theory-based causal transfer coupled with an associative learning scheme can be used to learn transferable structural knowledge under both observationally and structurally varying tasks. We executed a plethora of model-free RL algorithms, none of which learned a transferable representation of the OpenLock task, even under favorable baseline and transfer conditions. In contrast, the proposed model results are not only capable of successfully completing the task, but also adhere closely to the human participant results in edmonds2018human edmonds2018human.

These results suggest that current model-free RL methods lack the necessary learning mechanisms to learn generalized representations in hierarchical, structured tasks. Our model results indicate human causal transfer follows similar abstractions as those presented in this work, namely learning abstract causal structures and learning instance-specific knowledge that connects this particular environment to abstract structures. The model presented here can be used in any reinforcement learning environment where: (i) the environment is governed by a causal structure, (ii) causal cues can be uncovered from interacting with objects with observable attributes, and (iii) different circumstances share some common causal properties (structure and/or attributes).

6.1 Discussion

Why is causal learning important for RL? We argue that causal knowledge provides a succinct, well-studied, and well-developed framework for representing cause and effect relationships. This knowledge is invariant to extrinsic rewards and can be used to accomplish many tasks. In this work, we show that leveraging abstract causal knowledge can be used to transfer knowledge across environments with similar structure but different observational properties.

How can RL benefit from structured causal knowledge? Model-free RL is apt at learning a representation to maximize a reward within simple, non-hierarchical environments using a greedy process. Thus, current approaches do not restrict or impose learning an abstract structural representation of the environment. RL algorithms should be augmented with mechanisms to learn explicit structural knowledge and jointly optimized to learn both an abstract structural encoding of the task while maximizing rewards.

Why is CE more difficult than CC? Human participants, RL, and the proposed model all found CE more difficult than CC

. A natural question is: why? We posit that it occurs from a decision-tree perspective. In the

CC condition, if the agent makes a mistake on the first action, the environment will not change, and the rest of the attempt is bound to fail. However, should the agent choose the correct grey lever, the agent can choose either remaining grey levers; both of which will unlock the door. Conversely, in the CE condition, the agent has two grey levers to choose from in the first action; both will unlock the lever needed to unlock the door. However, the second action is more ambiguous. The agent could choose the correct lever, but it could also choose the other grey lever. Such complexity leads to more failure paths from a decision-tree planning perspective. The CC condition receives immediate feedback on the first action as to whether or not this plan will fail; the CE condition, on the other hand, has more failure pathways. We plan to investigate this property further, as this asymmetry was unexpected and unexplored in the literature.

What other theories may be useful for learning causal relationships? In this work, we adhere to an associative learning theory. We adopt the theory that causal relationships induce state changes. However, other theories may also be appealing. For instance, the associative theory used does not directly account for long-term relationships (delayed effects). More complex theories could potentially account for delayed effects; e.g., when an agent could not find a causal attribute for a particular event, the agent could examine attributes jointly to best explain the causal effect observed. Prior work has examined structural analogies [12, 33, 34] and object mappings [8] to facilitate transfer; these may also be useful to acquire transferable causal knowledge.

How can hypothesis space enumeration be avoided? Hypothesis space enumeration can quickly become intractable as problems increase in size. While this worked used a fixed, fully enumerated hypothesis space, future work will include examining how sampling-based approaches can be used to iteratively generate causal hypotheses. bramley2017formalizing bramley2017formalizing showed a Gibbs-sampling based approach; however, this sampling should be guided with top-down reasoning to guide the causal learning process by leveraging already known causal knowledge with proposed hypotheses.

How well would model-based RL perform in this task? Model-based RL may exhibit faster learning within a particular environment but still lacks mechanisms to form abstractions that enable human-like transfer. This is an open research question, and we plan on investigating how abstraction can be integrated with model-based RL methods.

How is this method different from hierarchical RL? Typically, hierarchical RL is defined on a hierarchy of goals, where subgoals represent options that can be executed by a high-level planner [5]. Each causally-plausible hypothesis can be seen as an option to execute. This work seeks to highlight the importance of leveraging causal knowledge to form a world-model and using said model to guide a reinforcement learner. In fact, our work can be recast as a form of hierarchical model-based RL.

Future work should primarily focus on how to integrate the proposed causal learning algorithm directly with reinforcement learning. An agent capable of integrating causal learning with reinforcement learning could generalize world dynamics (causal knowledge) and goals (rewards) to novel but similar environments. One challenge, unaddressed in this paper, is to how to generalize rewards to varied environments. Traditional reinforcement learning methods, such as Q-learning, do not provide a mechanism to extrapolate internal values to similar but different states. In this work, we showed how extrapolating causal knowledge can aid in uncovering the causal relationships in similar environments. Adopting a similar scheme for some form of reinforcement learning would enable reinforcement learners to succeed in the OpenLock task without iterating over the trials multiple times, and could enable one-shot reinforcement learning. Future work will also examine how a learner can iteratively grow a causal hypothesis while incorporating a background theory of causal relationships.


The authors thank Chi Zhang at the UCLA Computer Science Department, Feng Gao, Prof. Tao Gao, and Prof. Ying Nian Wu at the UCLA Statistics Department for helpful discussions. This work reported herein is supported by MURI ONR N00014-16-1-2007, DARPA XAI N66001-17-2-4029, ONR N00014-19-1-2153, and an NVIDIA GPU donation grant.


  • [1] N. R. Bramley, P. Dayan, T. L. Griffiths, and D. A. Lagnado (2017) Formalizing Neurath’s ship: approximate algorithms for online causal learning.. Psychological Review 124 (3), pp. 301. Cited by: §1.
  • [2] N. R. Bramley, D. A. Lagnado, and M. Speekenbrink (2015) Conservative forgetful scholars: how people learn causal structure through sequences of interventions.. Journal of Experimental Psychology: Learning, Memory, and Cognition 41 (3), pp. 708. Cited by: §1.
  • [3] N. R. Bramley, T. Gerstenberg, J. B. Tenenbaum, and T. M. Gureckis (2018) Intuitive experimentation in the physical world. Cognitive Psychology 105 (), pp. 9–38. Cited by: §1.
  • [4] P. W. Cheng (1997) From covariation to causation: a causal power theory.. Psychological Review 104 (2), pp. 367. Cited by: §1.
  • [5] N. Chentanez, A. G. Barto, and S. P. Singh (2005) Intrinsically motivated reinforcement learning. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §6.1.
  • [6] M. Edmonds, J. Kubricht, C. Summers, Y. Zhu, B. Rothrock, S. Zhu, and H. Lu (2018) Human causal transfer: challenges for deep reinforcement learning. In Annual Meeting of the Cognitive Science Society (CogSci), Cited by: §1, §4, Figure 5.
  • [7] C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1126–1135. Cited by: §5.
  • [8] T. Fitzgerald, A. Goel, and A. Thomaz (2018) Human-guided object mapping for task transfer. ACM Transactions on Human-Robot Interaction (THRI) 7 (2), pp. 17. Cited by: §6.1.
  • [9] T. L. Griffiths and J. B. Tenenbaum (2005) Structure and strength in causal induction. Cognitive Psychology 51 (4), pp. 334–384. Cited by: §1, §3.
  • [10] T. L. Griffiths and J. B. Tenenbaum (2009) Theory-based causal induction. Psychological Review 116 (4), pp. 661–716. Cited by: §1, §3.
  • [11] F. Heider (1958) The psychology of interpersonal relations. Psychology Press. Cited by: §1.
  • [12] T. Hinrichs and K. D. Forbus (2011) Transfer learning through analogy in games. AI Magazine 32 (1), pp. 70–70. Cited by: §6.1.
  • [13] K. Holyoak and P. W. Cheng (2011) Causal learning and inference as a rational process: the new synthesis. Annual Review of Psychology 62 (), pp. 135–163. Cited by: §1, §1.
  • [14] K. Kansky, T. Silver, D. A. Mély, M. Eldawy, M. Lázaro-Gredilla, X. Lou, N. Dorfman, S. Sidor, S. Phoenix, and D. George (2017) Schema networks: zero-shot transfer with a generative causal model of intuitive physics. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1809–1818. Cited by: §1.
  • [15] S. Legg and M. Hutter (2007) Universal intelligence: a definition of machine intelligence. Minds and Machines 17 (4), pp. 391–444. Cited by: §1.
  • [16] S. Levine, C. Finn, T. Darrell, and P. Abbeel (2016) End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research 17 (1), pp. 1334–1373. Cited by: §1.
  • [17] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §1.
  • [18] C. Maclaurin (1742) A treatise of fluxions: in two books. 1. Vol. 1, Ruddimans. Cited by: §3.
  • [19] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu (2016) Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning (ICML), Cited by: §5.
  • [20] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529. Cited by: §1, §5.
  • [21] I. Newton and J. Colson (1736) The method of fluxions and infinite series; with its application to the geometry of curve-lines. Henry Woodfall; and sold by John Nourse. Cited by: §3.
  • [22] J. Pearl (2009) Causality. Cambridge University Press. Cited by: §3.2.
  • [23] R. A. Rescorla and A. R. Wagner (1972) A theory of pavlovian conditioning: variations in the effectiveness of reinforcement and nonreinforcement. Classical conditioning II: Current research and theory 2 (), pp. 64–99. Cited by: §1.
  • [24] T. Schaul, J. Quan, I. Antonoglou, and D. Silver (2016) Prioritized experience replay. In International Conference on Learning Representations (ICLR), Cited by: §5.
  • [25] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz (2015) Trust region policy optimization. In International Conference on Machine Learning (ICML), Cited by: §1, §5.
  • [26] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §1, §5.
  • [27] D. R. Shanks and A. Dickinson (1988) Associative accounts of causality judgment. Psychology of learning and motivation 21 (), pp. 229–261. Cited by: §1.
  • [28] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. (2016)

    Mastering the game of go with deep neural networks and tree search

    Nature 529 (7587), pp. 484–489. Cited by: §1.
  • [29] A. E. Stahl and L. Feigenson (2015) Observing the unexpected enhances infants’ learning and exploration. Science 348 (6230), pp. 91–94. Cited by: §1.
  • [30] J. B. Tenenbaum, T. L. Griffiths, and C. Kemp (2006) Theory-based bayesian models of inductive learning and reasoning. Trends in Cognitive Sciences 10 (7), pp. 309–318. Cited by: §1, §3.
  • [31] M. Thielscher (1998) Introduction to the fluent calculus. Citeseer. Cited by: §3.
  • [32] M. R. Waldmann and K. J. Holyoak (1992) Predictive and diagnostic learning within causal models: asymmetries in cue competition. Journal of Experimental Psychology: General 121 (2), pp. 222–236. Cited by: §1.
  • [33] C. Zhang, F. Gao, B. Jia, Y. Zhu, and S. Zhu (2019) RAVEN: a dataset for relational and analogical visual reasoning. In

    Conference on Computer Vision and Pattern Recognition (CVPR)

    Cited by: §6.1.
  • [34] C. Zhang, B. Jia, F. Gao, Y. Zhu, H. Lu, and S. Zhu (2019) Learning perceptual inference by contrasting. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §6.1.
  • [35] C. Zhang, O. Vinyals, R. Munos, and S. Bengio (2018) A study on overfitting in deep reinforcement learning. arXiv preprint arXiv:1804.06893. Cited by: §1.