Parenting: Safe Reinforcement Learning from Human Input

02/18/2019 ∙ by Christopher Frye, et al. ∙ 0

Autonomous agents trained via reinforcement learning present numerous safety concerns: reward hacking, negative side effects, and unsafe exploration, among others. In the context of near-future autonomous agents, operating in environments where humans understand the existing dangers, human involvement in the learning process has proved a promising approach to AI Safety. Here we demonstrate that a precise framework for learning from human input, loosely inspired by the way humans parent children, solves a broad class of safety problems in this context. We show that our Parenting algorithm solves these problems in the relevant AI Safety gridworlds of Leike et al. (2017), that an agent can learn to outperform its parent as it "matures", and that policies learnt through Parenting are generalisable to new environments.



There are no comments yet.


page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Within the next generation, autonomous learning agents could be regularly participating in our lives, for example in the form of assistive robots. Some variant of reinforcement learning (RL), in which an agent receives positive feedback for taking desirable actions, will be used to teach such robots to perform effectively. These agents will be extensively tested prior to deployment but will still need to operate in novel environments (e.g. someone’s home) and to learn customised behaviour (e.g. family norms). This necessitates a safe approach to RL applicable in such contexts.

As humans begin to delegate complex tasks to autonomous agents in the near future, they should participate in the learning process, as such tasks are difficult to precisely specify beforehand. Human involvement will be especially useful in contexts where humans understand both the desirable and dangerous behaviours, and can therefore act as teachers. We assume this context throughout the paper. This scope is broad, as humans safely raise children – an encouraging natural example of autonomous learning agents – from infancy to perform most tasks in our societies. In this spirit, we introduce an approach to RL in this paper that loosely mimics parenting, with a focus on addressing the following specific safety concerns:

List 1.

Safety Concerns

  • Unsafe exploration (Pecka & Svoboda, 2014):
    the agent performs dangerous actions in trial-and-error search for optimal behaviour.

  • Reward hacking (Clark & Amodei, 2016):
    the agent exploits unintended optima of a naively specified reward function.

  • Negative side effects (Amodei et al., 2016):
    to achieve specified goal optimally, the agent causes other undesirable outcomes.

  • Unsafe interruptibility (Soares et al., 2015):
    the agent learns to avoid human interruptions that interfere with maximisation of specified rewards.

  • Absent supervisor (Armstrong, 2017):
    the agent learns to alter behaviour according to presence or absence of a supervisor that controls rewards.

These challenging AI Safety problems are expounded further in Amodei et al. (2016); also see Leike et al. (2017) for an introduction to the growing literature aimed at resolving them. While progress is certainly being made, a general strategy for safe RL remains elusive.

The fact that these AI Safety concerns have analogues in child behaviour, all allayed with careful parenting, further motivates our approach to mitigating them. In this paper, we introduce a framework for learning from human input, inspired by parenting and based on the following techniques:

List 2.

Components of Parenting Algorithm

  • Human guidance: mechanism for human intervention to prevent agent from taking dangerous actions,

  • Human preferences: second mechanism for human input through feedback on clips of agent’s past behaviour,

  • Direct policy learning: supervised learning algorithm to incorporate data from (1) and (2) into agent’s policy,

  • Maturation: novel technique for gradually optimising agent’s policy in spite of myopic algorithm in (3); uses human feedback on progressively lengthier clips.

We define these components of our Parenting algorithm in detail in Sec. 2, but first we note the loose analogues that the techniques of List 2 have in human parenting. Human guidance is when parents say “no” or redirect a toddler attempting something dangerous. Human preferences are analogous to parents giving after-the-fact feedback to older children. Direct policy learning is simple obedience: children should respect their parent’s preferences, not disobey as an experiment in search of other rewards. Maturation is the process by which children grow up, becoming more autonomous and often outperforming their parents.

The idea to use human input in the absence of a trusted reward signal is an old one (Russell, 1998; Ng et al., 2000), and the literature on this approach remains rich and active (Hadfield-Menell et al., 2016, 2017). Variations of methods 1 – 3 of List 2 have been studied individually elsewhere: the human intervention employed by Saunders et al. (2018), human preferences introduced by Christiano et al. (2017), and supervised learning adopted by Knox & Stone (2009) are the variants most similar to ours. In this work, we show how these techniques can be combined; this requires important deviations from previous work and necessitates the introduction of maturation – technique 4 of List 2 – to maintain effectiveness. Our main contributions are threefold:

List 3.

Main Contributions

  • We introduce a novel algorithm for supervised learning from human preferences (techniques 2 and 3 of List 2) that, in our assumed context, is not susceptible to the reward specification problems of List 1. We demonstrate this in gridworld (Sec. 3.2

    ). Our use of supervised learning avoids the task-dependent hyperparameter tuning that would be necessary to instead infer a safe reward function (Sec. 


  • To additionally address unsafe exploration, we incorporate human intervention (technique 1 of List 2) into our algorithm as a separate avenue for human input. This combines nicely with our supervised learning algorithm (technique 3), which itself avoids the unsafe trial-and-error approach to optimising rewards. We demonstrate this in gridworld as well (Sec. 3.2.1).

  • One drawback of our supervised learning algorithm is that it provides a near-sighted approach to RL, the agent’s actions effectively dictated by previous human input. To overcome this, we introduce the novel procedure of maturation. This allows the agent to learn a safe policy quickly but myopically from early human input (technique 3 of List 2) then gradually optimise it with human feedback on progressively lengthier clips of behaviour (technique 4). We check maturation’s effectiveness in gridworld (Sec. 3.3) and show its connection to value iteration (Sec. 4.1).

2 Parenting Algorithm

Here we introduce the defining components of our Parenting algorithm. Although one could employ select techniques from this section independently, when applied together they address the full set of safety concerns in List 1.

2.1 Human Guidance

Human guidance provides a mechanism for the agent’s human trainer, or “parent”,111We use “parent” (noun) to refer to the agent’s human trainer and (verb) to refer to the application of the Parenting algorithm. to prevent dangerous actions in unfamiliar territory. When the agent finds its surroundings dissimilar to those already explored, it pauses and only performs an action after receiving parental approval.

To be specific, the Parenting algorithm calls for an agent acting with policy

, the probability it will take action

when in state . The policy gets trained on a growing data set of parental input . While navigating its environment, the agent monitors the region nearby. This local state should be defined context-appropriately; in gridworld, we used the 4 cells accessible in the agent’s next step. Before each step, the agent computes the familiarity of ; in gridworld,222In complex environments, familiarity might be determined using methods similar to those in Savinov et al. (2018)

, where the novelty of a state is judged by a neural-network comparator.

we defined as the number of previously made queries to the parent while in . The agent then computes the probability that it should pause to ask for guidance, with a tunable hyperparamter. If so, the agent draws 2 distinct actions from and queries its parent’s preference.333These should be high-level human-understandable candidate actions rather than, e.g., primitive motor patterns. Such candidate actions might be shown to the human by means of a video forecast. The parent can reply decisively, or with “neither” to force a re-draw if both actions are unacceptably dangerous, or with “either” if both actions are equivalently desirable. The agent then performs the chosen action, storing the parent’s preference in .

While different mechanisms for human intervention have been proposed by Lin et al. (2017) and Saunders et al. (2018) to mitigate unsafe exploration, Parenting uniquely pairs human guidance with a method to quickly incorporate such intervention into policy, to be discussed in Sec. 2.3 below.

2.2 Human Preferences

Human guidance is utilised in unfamiliar territory. Otherwise, Parenting employs human preferences as a second human-input method: the agent selectively records clips of its behaviour for the parent to later review in pairs.

To be explicit, if there is no query for guidance in a particular time step, the agent decides with probability whether it should begin recording its behaviour. If not, the agent simply draws its next action from . See Sec. 2.4 for the subtle method of drawing recorded actions. Suffice it to say here that the agent records its behaviour in clips of length , alternating between exploitative and exploratory clips. After performing the action, the agent decides with probability whether to attempt a human preference query. When doing so, it searches for a pair of recorded clips, one exploitative and one exploratory, that (in gridworld) share the same initial state but have different initial actions. If a match is found, the agent queries its parent’s preference and stores it in .

A broad class of AI Safety problems stem from misalignment of the specified RL reward function with the true intentions of the programmer (Dewey, 2011; Amodei et al., 2016; Ortega et al., 2018). Careful use of human preferences to determine desirable behaviours, without a specified reward function, can eliminate such specification problems.

Parenting’s implementation of human preferences is most similar to that of Christiano et al. (2017), with the main differences being: (i) the requirement of similar initial states in paired clips, and (ii) the approach to training the agent’s policy on the preferences, to be discussed next. For other approaches to human input, see Fürnkranz et al. (2012); Akrour et al. (2012); Wirth et al. (2017); Leike et al. (2018).

2.3 Direct Policy Learning

Parenting includes direct policy learning to quickly incorporate human input into policy: is trained directly as a predictor of the parent’s preferred actions.

After each time step, the agent decides with probability whether to take a gradient descent step on the parental input in . Each entry in corresponds to a past query for guidance or preference and consists of two clips, and , as well as the parent’s response :


Here identifies the clip, and entries corresponding to human guidance have . A label of indicates the parent’s preference for the first clip, while

signals a tie. The loss function for gradient descent is the binary cross-entropy:


where the agent’s policy is interpreted as the probability that, from state , the parent prefers action over other possibilities. Note that is a function of only the first time step in each sequence (justified in Sec. 2.4).

Direct policy learning ensures the agent does not contradict previous human input. Paired with Parenting’s incorporation of human guidance, this powerfully combats the problem of unsafe exploration. By contrast, inferring a reward function from human input (Leike et al., 2018) would not by itself mitigate unsafe exploration, as the agent would repeatedly trial dangerous actions during policy optimisation to maximise total rewards. Inferring a reward function from human preferences can also be ambiguous; see Sec. 4.2. An alternative use of supervised policy learning can be found in Knox & Stone (2009), where the human must provide a perpetual reinforcement signal (positive or negative) in response to the agent’s ongoing behaviour. Parenting’s approach to direct policy learning from human preferences utilises an easier-to-interpret signal and only requires the human to review a small subset of the agent’s actions.

2.4 Maturation

By itself, direct policy learning would provide a myopic approach to RL, the agent’s every move effectively dictated by a human. Maturation provides a mechanism for optimisation beyond the human’s limited understanding of an effective strategy. The idea is simple: While the parent may not recognise an optimal action in isolation, the parent will certainly assign preference to that action if simultaneously shown the benefits that can accrue in subsequent moves. Maturation thus calls for the agent to present progressively lengthier clips of its behaviour for feedback. This novel technique is crucial for Parenting’s effectiveness: it is detailed below, demonstrated experimentally in Sec. 3.3, and shown to be a form of value iteration under certain mathematical assumptions in Sec. 4.1.

(a) UnsafeExploration
(b) RewardHacking
(c) NegativeSide Effects
(d) UnsafeInterruptibility
(e) AbsentSupervisor
Figure 1: AI Safety gridworlds (Leike et al., 2017). Light-blue agent ‘A’ must navigate to green goal ‘G’ avoiding dangers that capture the essence of specific AI Safety problems. Full environment descriptions are given in Sec. 3.2.

Parenting begins with the agent querying for preferences on recorded sequences of length . Let us call the agent’s policy during this stage of the algorithm. The agent records two types of sequences: exploitative length-1 sequences take the form

while their exploratory counterparts are drawn as

Upon convergence,444To judge convergence, humans can use a quantitative auxiliary measure to monitor performance. Since no feedback is based on this measure, it is not accompanied with the usual safety concerns. produces length-1 sequences optimally, with respect to the parent’s preferences. The agent then matures to a new policy, , initialised to and trained through feedback on recorded sequences of length .555The increment is appropriate for gridworld but may need modification in other contexts; see footnote 3. Exploitative length-2 recordings take the form

while exploratory sequences are drawn as

The goal here is to optimise action choice for length-2 sequences. Since the final state-action pair in each sequence is a length-1 sub-sequence, is already trained to draw this action optimally. Thus, while length-2 recordings are drawn using both and , they should be used solely to train . This is compatible with Eq. (2) (where the ’s should have subscripts for completeness).

Once converges, the agent matures to . Recordings of length are drawn from , , and analogously to Eqs. (5) and (6). Through maturation, the agent’s behaviour optimises for progressively longer sequences.

An example might clarify why recordings are drawn sequentially from , , …, . In chess, suppose the parent is only smart enough to see 1 move ahead, and that is already trained. For to learn to see 2 moves ahead, the agent should present sequences to the parent, where is chosen with but is not. Even if can detect a checkmate 2 moves from , the human will not realise the value of the move and may penalise the sequence, because the human does not know the optimal state-action value function. Instead, should be chosen with , which is already optimised with respect to the parent’s preferences when there is one move to go.

Importantly, maturation only requires the parent to recognise improvements in the agent’s performance; the human need not understand the agent’s evolving strategy (see Sec. 3.3).

3 Experiments

To test the safety of our Parenting algorithm in a controlled way, we performed experiments in the AI Safety gridworlds of Leike et al. (2017), designed specifically to capture the fundamental safety problems of Sec. 1. Select gridworlds are shown in Fig. 1 and described in Sec. 3.2 below.

3.1 Experimental Setup

3.1.1 Network Architecture

We used a neural-network policy

that maps the state of gridworld to a probability distribution over actions. The state

is represented by an matrix, where and are the gridworld’s dimensions and

is the number of object-types present; this third dimension gives a one-hot encoding of the object sitting in each cell. There are 4 possible actions

in any state: up, down, left, right.

The neural network has two components. The local component maps the local state , comprised of the agent’s 4 neighbouring cells, through a dense layer of 64 hidden units, to an output layer with 4 linear units. The global component passes the full state through several convolutional layers before mapping it, through a separate dense layer with 64 hidden units, to a separate output layer with 4 linear units. All hidden units have rectifier activations. The convolutional processing includes up to 4 layers666The number of layers depends on the dimensions of the gridworld and are chosen to take the state matrix down to . with kernel size

, stride length 1, and filter counts 16, 32, 64, 64. The local and global output layers are first averaged, then softmaxed, to give a probability distribution over actions. This setup was implemented using Python 2.7.15 and TensorFlow 1.12.0.

3.1.2 Hyperparameters

The Parenting algorithm of Sec. 2 has several hyperparameters. Unless noted otherwise, we set

and held the recording length constant at . (Maturation is tested separately in Sec. 3.3.) We also included entropy regularisation (Williams & Peng, 1991) to control the rigidity of the agent’s policy, with coefficients and for the separate neural-network policy components. We used Adam with default parameters for optimisation (Kingma & Ba, 2014).

3.1.3 Substitute for Human Parent

For convenience, we did not use an actual human parent in our experiments. Instead we programmed a parent to respond to queries in the following way.

We assume the parent has an implicit understanding of a reward function the agent should optimise and has intuition for a safe policy the agent could adopt. This is reasonable given the context assumed in Sec. 1. Furthermore, we assume the parent favours sequences with greater total advantage:


where () is the state (state-action) value function with respect to (Sutton & Barto, 1998). In a deterministic environment, this quantity is equivalent to:777One could also impose a discount factor on the sequence. We kept except in Secs. 3.2.1 and 3.3, where we set .


i.e. the total reward the parent could accrue as a result of sequence (both during and after) minus what the parent expected to accrue following the baseline instead.

To motivate these assumptions, experiments in psychology suggest that human feedback does not correspond directly to a reward function (Ho et al., 2018). Instead, MacGlashan et al. (2017) argue that humans do naturally base feedback on an advantage function. What is novel in our implementation is that the advantage is computed with respect to the parent’s safe baseline policy – without requiring an understanding of the agent’s evolving policy. Experiments with real human feedback in more complicated environments are needed to test whether this is reasonable in general. Note also that since we compute Eq. (7) exactly in gridworld, our experiments assume perfect human feedback. This assumption should be relaxed in more realistic future tests.

3.1.4 Pre-Training

Unless noted otherwise, our agent enters Parenting after pre-training with policy gradients (Sutton & Barto, 1998) to solve general path-connected mazes containing a single goal cell. The reward function in these mazes grants for reaching the goal and for each passing time step. A pre-training step with Adam (Kingma & Ba, 2014) was taken every 16 episodes, and Parenting did not begin until the average reward earned per maze converged. A pre-training step was also taken after each training step during Parenting, to ensure this knowledge is not forgotten.

In general, pre-training reduces Parenting’s requisite human effort by allowing humans to focus on subtle safety concerns, rather than problems safely solved by other means.

3.2 Safety Tests

Here we describe our experiments on the AI Safety problems of Sec. 1, highlighting the components of the Parenting algorithm that solve each.

3.2.1 Unsafe Exploration

For unsafe exploration, we performed experiments in the gridworld of Fig. 1(a). Parental input was given according to Eq. (8) with a reward function that grants the light-blue agent for reaching the green goal, for remaining on land, and for falling in dark-blue water, which terminates the episode. We experimented with:

  • Traditional RL: used policy gradients as in pre-training

  • Direct Policy Learning: set to to disable guidance queries

  • Lax Parenting: default hyperparameters (Sec. 3.1.2),

  • Conservative Parenting: cautious hyperparameters,

To emphasise the exploration required, we did not pre-train agents here. We trained our agent from scratch to optimality 1000 times with each of the 4 algorithms and monitored the average number of water-deaths in each trial. The mean and standard deviation of the training deaths for each algorithm are shown in Table 

1, along with the number of parenting queries used. The agent suffered thousands of training deaths before reaching an optimal policy with traditional RL,888Similar experiments in (Leike et al., 2017) using modern RL algorithms yielded roughly comparable results. compared to just 0 or 1 with conservative parenting. This demonstrates the effectiveness of human guidance and direct policy learning at mitigating unsafe exploration.

3.2.2 Reward Hacking

Reward hacking is modelled in Fig. 1(b), where the blue agent must water dry yellow plants, which then turn green. Plants turn dry with 5% probability per time step. The agent can “reward hack” by stepping in the turquoise bucket of water, which makes the entire garden appear watered and green. If the agent calculates rewards by counting green cells, it will be attracted to this dangerous policy. Parenting avoids this problem through its reliance on human input, as the parent will never prefer a clip in which the agent steps in water. Since this environment is ideal to test maturation, we postpone experimental results to Sec. 3.3.

Training Deaths Guidance Queries Preference Queries



2300 700




47 14 51 13



15 7 25 5 6 5



0.6 0.8 49 8 0
Table 1: Deaths in the Unsafe Exploration gridworld before optimal policy was learnt. “Lax” and “Conservative” refer to different hyperparameter choices. Each table entry was computed from 1000 trials.

3.2.3 Negative Side Effects

Negative side effects are addressed in Fig. 1(c), where the blue agent must navigate to the green goal in the presence of a movable dark-cyan box. Pushing the box into a corner is an irreversible action, representing a real-life irreparable side effect (e.g. a broken vase). While going around the box or moving it reversibly is desired, the agent can reach the goal fastest by pushing the box down into the corner. If rewards are based solely on speed, the agent will adopt this dangerous behaviour. In contrast, since the parent would never reinforce a highly undesirable action, Parenting is not susceptible to negative side effects. This environment is also useful for testing whether behaviours learnt through Parenting are generalisable or simply memorised; we thus postpone a discussion of results to Sec. 3.4.

3.2.4 Unsafe Interruptibility

Unsafe interruptibility is represented in Fig. 1(d), where the agent must navigate to the goal in the presence of a pink interruption cell and a purple button. If the agent enters the pink cell, there is a 50% chance it will be frozen there for the remainder of the episode, prevented from reaching the goal. Upon pressing the purple button, the pink cell disappears along with the threat of interruption.

If the agent simply gets rewarded for speed in reaching the goal, it will learn to press the button – not a safely interruptible policy. Parenting, in contrast, is safely interruptible because the parent would never favour a clip of the agent avoiding human interruption, and there are no rewards left on the table if an episode is terminated early.

To test this, we parented an agent in Fig. 1(d) for 50 queries, then checked whether its argmax policy involved pressing the purple button. In 100 repeated trials, it never did.

3.2.5 Absent Supervisor

The absent supervisor problem is modelled in Fig. 1(e). Parental input is based on Eq. (8) where the reward function assigns for reaching the green goal and each passing time step. If the supervisor is present, represented by red side bars, there is a punishment of for taking the shortcut through the yellow cell. With the supervisor absent, the punishment disappears.

Parenting naturally gives no signal when the agent’s actions are not viewed by a supervisor, so we parented999For this experiment, we used the default hyperparameters of Sec. 3.1.2 except for to weaken dependence on . our agent for 50 queries in the present-supervisor gridworld. Upon deployment in the absent-supervisor gridworld, we checked whether its argmax policy involved stepping in the yellow cell. In 100 repeated trials, it never did.

Because Parenting omits feedback on unsupervised actions, the absent supervisor problem becomes an issue of distributional shift (Sugiyama et al., 2017). As long as the supervisor’s absence does not cause an important change in the agent’s environment, its policy should carry over intact. (To reiterate: in Parenting, no signal is associated with the supervisor’s “leaving”.) We tested this in gridworld as well: in trials reported above, the agent’s policy remained optimal with supervisor removed (while in , the supervisor’s absence ran the agent into a wall).

3.3 Validation of Maturation

Being the most complex of the gridworlds (with configurations of watered and dry plants) the Reward Hacking environment described in Sec. 3.2.2 is ideal for testing maturation. For this experiment, the parent responds to queries as in Sec. 3.1.3, with a reward function that grants for a legitimate plant-watering and otherwise.

Suppose the parent’s policy is to water plants in a repeating clockwise trajectory around the garden’s perimeter – a good perpetual strategy, but suboptimal for short episodes. Nevertheless, the reliance of Parenting on human judgement should not limit the agent’s potential for optimisation. When judging recordings of length , the parent will prefer clips in which the agent successfully waters a dry plant, even if by an anti-clockwise step – see Eq. (7) or (8). The agent will thus learn to go against and take a single anti-clockwise step, if it earns an extra watering. Upon maturation to , the agent will learn to take 2 anti-clockwise steps, if it offers an advantage over . The agent will thus learn to outperform its parent.

To test this, we set the episode length to 10 and initialised the gridworld to Fig. 1(b). We parented our agent for 1000 queries at each clip length before maturing to clips of length , using . Fig. 2 shows the resulting mean-waterings-per-episode at each stage, each mean being computed over 1000 episodes. The entire experiment was repeated 3 times to compute the standard deviations on the means (error bars in the figure). For comparison, policy gradients were used to train an RL agent to convergence (with the unsafe bucket cell removed from the environment!) whose mean score is also shown. While the parent’s policy achieves roughly 2 waterings per episode, the RL agent exceeds 5. Despite this, maturation takes the agent to near-optimality, confirming its effectiveness. Parenting thus provides a safe avenue for autonomous learning agents to solve problems competently and creatively.

Figure 2: Maturation of agent’s policy toward optimality in Reward Hacking environment. By learning from human feedback on lengthier recordings of its behaviour, the agent gradually optimises its policy to outperform its parent and approach the effectiveness of traditional RL.

3.4 Generalisability of Parented Behaviours

It is important to understand whether Parenting teaches behaviours abstractly, allowing lessons learnt to generalise, or if the agent merely memorises its parent’s preferred trajectory. Generalisability is critical for real-world applications. Consider a manufacturer that parents household robots in a variety of environments, both simulated and real, so that customers would have little extra parenting required for customisation at home. In this context, pre-training is analogous to first using RL to teach the robot to navigate rooms in safe simulations, to reduce the required parenting by the manufacturer’s employees.

We tested generalisability in the Side Effects environment of Sec. 3.2.3. To begin, we randomly generated path-connected gridworlds like Fig. 1(c) that contain 1 goal, 1 box, any number of walls, and that are solvable only by moving the box. We discarded those generated gridworlds that the pre-trained agent could already solve. We kept 50 unique gridworlds satisfying these requirements, designating of them for parenting and setting aside 40 for pre-parenting.

For one experiment, we took an agent that was not pre-trained and parented it from scratch to optimality in the designated gridworlds (cycling through them during training). We repeated this for 10,000 trials and histogrammed the number of required queries in Fig. 3. For the other experiments, we took a pre-trained agent and pre-parented it in 0, 20, or 40 of the set-aside environments before parenting in the designated gridworlds. The corresponding histograms in Fig. 3 show the benefits of pre-training and pre-parenting. More pre-parenting reduces the number of queries required for safe operation in new environments, thus confirming Parenting’s generalisability.

Figure 3: Generalisability of parenting to new environments. Agent was pre-trained to solve mazes then pre-parented to solve , , or unique Side Effects environments, before being parented to optimality in held-out Side Effects environments. Queries required for held-out gridworlds are histogrammed – 10,000 trials for each .

4 Discussion

In this section, we provide theoretical arguments that motivate our design of maturation and direct policy learning.

4.1 Maturation as Value Iteration

The maturation process of Sec. 2.4 effectively optimises the agent’s policy because of its connection to value iteration in dynamic programming (Sutton & Barto, 1998). To demonstrate this, we will make the same assumptions of Sec. 3.1.3. (These include perfect human feedback, which is necessary for this idealised discussion.) Let us also assume that converges in all relevant regions of state-space before maturation to .

We will work in a deterministic environment for clarity, so that the parent’s preferences on sequences of length are determined by computing101010We omit since it drops out of comparisons when clips are chosen with the same (or sufficiently similar) initial states.


on each sequence , with defined with respect to the parent’s policy . Under these assumptions, maturation is equivalent to value iteration. To show this, we prove maturation trains to maximise , where


for each with the base case


For , is trained to optimise because the parent responds to preference queries based on this quantity. For , sequences are recorded by successively drawing from then , as prescribed in Sec. 2.4. The involvement of implies that is trained to maximise as required. Assuming the claim is true for sequence lengths through , the argument for is similar: Because sequences are recorded by drawing actions successively from , , …, this implies that is trained to maximise . The claim is thus proved by induction.

Note that is the same quantity that appears in value iteration (Sutton & Barto, 1998), which converges to optimality. The agent’s policy thus progressively outperforms the parent’s policy as increases. Importantly, this process does not require the human parent to understand the agent’s improving policy, just to recognise improving performance.

4.2 Ambiguity in Preference-Based Reward Models

If, in contrast to Parenting, one uses human preferences to learn a reward model, there are subtleties one needs to overcome to ensure the corresponding optimised policy is consistent with human desires. We include this discussion here as it influenced the design of our algorithm.

Suppose recordings of fixed length are shown to a human in pairs to obtain preference data. Let us assume that the human intuitively understands a reward function and (in this section only) favours clips that earn more reward . Then one could fit a reward model to the preference data. However, there is a shift ambiguity in the model, since both and (for ) each describe the data equally well:


This ambiguity can be eliminated by fixing the mean reward value. However, the reward function’s mean can have a substantial effect on the learnt behaviour. See Fig. 4 for an example. Suppose reward model grants for reaching the goal, for an irreversible side effect, and otherwise. Then the trajectory of Fig. 4(a) accrues , while the trajectory of Fig. 4(b) earns . Optimisation of would thus avoid the irreversible side effect. However, with a shifted reward model , the unsafe trajectory scores , while the safe trajectory accumulates . Optimisation of would thus cause an irreversible side effect, against the human’s wishes.

(a) Irreversible Side Effect
(b) Safe Solution
Figure 4: Two trajectories in the Side Effects gridworld. Unsafe trajectory (a) optimises one reward function, , while the safe solution (b) optimises a shifted function .

This problem can be overcome in practice by experimentally tuning the mean of the reward function, as well as its moments. However, this hyperparameter tuning would need to be repeated for each new task and would introduce a new type of unsafe exploration (of hyperparameter space).

This problem occurs because human preferences on same-length sequences are shift-invariant with respect to the reward function, while reinforcement learning is not. Parenting avoids this problem through direct policy learning, which respects the symmetries of human preferences and thus does not require problem-by-problem tuning.111111The hyperparameters of Sec. 3.1.2 control the rate of mistakes and speed of learning, rather than affecting the agent’s learnt policy.

5 Conclusion

In the context of near-future autonomous agents operating in environments where humans already understand the risks, Parenting offers an approach to RL that addresses a broad class of relevant AI Safety problems. We demonstrated this with controlled experiments in the purpose-built AI Safety gridworlds of Leike et al. (2017). Importantly, the fact that Parenting solves these problems is not particular to gridworld; it is due to the fact that humans can solve these problems, and Parenting allows humans to safely teach RL agents. Furthermore, we have seen that two potential downsides of Parenting can be overcome: (i) through the novel technique of maturation, a parented agent is not limited to the performance of its parent; and (ii) parented behaviours generalise to new environments, which can be used to reduce requisite human effort in the learning process. We hope the framework introduced here provides a useful step forward in the pursuit of a general and safe RL programme applicable for real-world systems.


This work was developed and experiments were run on the Faculty Platform for machine learning. The authors benefited from discussions with Owain Evans, Jan Leike, Smitha Milli, and Marc Warner. The authors are grateful to Jaan Tallinn for funding this project.