Optimizing Agent Behavior over Long Time Scales by Transporting Value

by   Chia-Chun Hung, et al.

Humans spend a remarkable fraction of waking life engaged in acts of "mental time travel". We dwell on our actions in the past and experience satisfaction or regret. More than merely autobiographical storytelling, we use these event recollections to change how we will act in similar scenarios in the future. This process endows us with a computationally important ability to link actions and consequences across long spans of time, which figures prominently in addressing the problem of long-term temporal credit assignment; in artificial intelligence (AI) this is the question of how to evaluate the utility of the actions within a long-duration behavioral sequence leading to success or failure in a task. Existing approaches to shorter-term credit assignment in AI cannot solve tasks with long delays between actions and consequences. Here, we introduce a new paradigm for reinforcement learning where agents use recall of specific memories to credit actions from the past, allowing them to solve problems that are intractable for existing algorithms. This paradigm broadens the scope of problems that can be investigated in AI and offers a mechanistic account of behaviors that may inspire computational models in neuroscience, psychology, and behavioral economics.


Synthetic Returns for Long-Term Credit Assignment

Since the earliest days of reinforcement learning, the workhorse method ...

Learning Guidance Rewards with Trajectory-space Smoothing

Long-term temporal credit assignment is an important challenge in deep r...

Credit Assignment as a Proxy for Transfer in Reinforcement Learning

The ability to transfer representations to novel environments and tasks ...

Sparse Attentive Backtracking: Temporal CreditAssignment Through Reminding

Learning long-term dependencies in extended temporal sequences requires ...

Prospective Learning: Back to the Future

Research on both natural intelligence (NI) and artificial intelligence (...

Abstractions of General Reinforcement Learning

The field of artificial intelligence (AI) is devoted to the creation of ...

The Peril of Artificial Intelligence

— The integration of AI technology is with the hope of reducing human er...

The Reconstructive Memory Agent

We solve this task with a vision and memory-based agent, which we name the Reconstructive Memory Agent (RMA) (Figure 1c), which is based on a previously published agent model[20] but simplified for the present study. Critically, this agent model combines a reconstruction process to compress useful sensory information with memory storage that can be queried by content-based addressing[22, 23, 24] to inform the agent’s decisions. The RMA itself does not have specialized functionality to subserve long-term temporal credit assignment but provides a basis for the operation of the Temporal Value Transport algorithm, which does.

In this model, an image frame , the previous reward , and the previous action constitute the observation at time step

. These inputs are processed by encoder networks and merged into an embedding vector

, which is to be combined with the output of a recurrent neural network (RNN) based on the Differentiable Neural Computer

[24]. This RNN consists of a recurrent LSTM “controller” network and a memory matrix of dimension . The output of this RNN and memory system from the previous time step consists of the LSTM output and ( here) vectors of length read from memory , which we refer to as memory read vectors. Together, these outputs are combined with the embedding vector by a feedforward network into a “state representation” . Importantly, the state representation has the same dimension as a memory read vector. Indeed, once produced it will be inserted into the memory at the next time step into the -th row: .

Before this occurs, however, the RNN carries out one cycle of reading from memory and computation. The state representation is provided as input to the RNN, alongside the previous time step’s memory read vectors to produce the next . Then reading memory to produce the current time step’s memory read vectors occurs: read keys of dimension are produced as a function of , and each key is matched against every row using a similarity measure . The similarities are scaled by a positive read strength parameter (also computed as a function of ), to which a softmax over the weighted similarities is applied. This creates an attentional read weight vector with dimension , which is used to construct the -th memory read vector .

The state representation is also sent to decoder networks whose objective functions require them to produce reconstructions of the observations (the carets denote approximate quantities produced by networks) while also predicting the value function . This process ensures that contains useful sensory information in a compressed format. Finally, the state representation and RNN outputs are provided as input to the policy network to construct the policy distribution , which is a multinomial distribution over the discrete action space here. At each time step, an action is sampled and applied to the environment.

When trained on Passive Visual Match, all the agents we tested did succeed at the apple collection distractor task (Supplementary Figure 1), although only the RMA learned to solve the distal reward task by appropriately selecting the same colored square in P3 as was seen in P1 (Figure 1d). A comparison agent without an external memory (the LSTM agent) was able to achieve only slightly better than chance performance in P3, and a comparison agent with an external memory but no reconstruction objective decoding observation data from (the LSTM+Mem agent) also performed worse. The reconstruction process in the RMA helps to build and stabilize perceptual features in that can later be found by memory retrieval[20]. The solution of the RMA was robust. In Supplementary Figure 2, we demonstrate equivalent results for 0, 15, 30, 45, and 60 second distractor intervals: the number of episodes required to learn remained roughly independent of the delay (Supplementary Figure 3). Additionally, for more complicated visual stimuli consisting of CIFAR images[25], the RMA was also able to make correct matching choices (Supplementary Figure 4).

Despite the delay between P1 and P3, Passive Visual Match does not require long-term temporal credit assignment. The cue in P1 is automatically observed; an agent only needs to encode and retrieve a memory to move to the correct pad in P3 – a process that is relatively brief. Consequently, an agent with a small discount factor ( steps at frames per second, giving a 1.67 second half-life) was able to solve the task. However, the ability to encode and attend to specific past events was critical to the RMA’s success. In Figure 1e, we see the attentional weighting vector produced by one of the RMA read keys in an episode at time step 526, which corresponds to the beginning of P3. The weighting was sparsely focused on memories written in the first few episode time steps, during the instants when the agent was encoding the colored square. The learned memory retrieval identified relevant historical time points and bridged the 30 second distractor interval. Recall of memories in the RMA is driven by the demand of predicting the value function and producing the policy distribution . As we have seen, these objectives allowed the agent to automatically detect past time points that were relevant to its current decision.

Figure 2: Temporal Value Transport and Type 1 Information Acquisition Tasks. a. First-person (upper row) and top-down view (lower row) in Active Visual Match task while the agent is engaged in the task. In contrast to Passive Visual Match, the agent must explore to find the colored square, randomly located in a two-room environment. The agent and colored square are indicated by the yellow and red arrow, respectively. b. Without rewards in P2, RMA models with large discount factors (near 1) were able to solve the task; the RMA with exhibited retarded but definite learning with modest P2 reward (1 point per apple). c. Cartoon of the Temporal Value Transport mechanism: the distractor interval is spliced out, and the value prediction from a time point in P3 is directly added to the reward at time in P1. d. The TVT agent alone was able to solve Active Visual Match with large rewards during the P2 distractor, and faster than agents exposed to no distractor reward. The RMA with discount factor was able to solve a greater than chance fraction because it could randomly encounter the colored square in P1 and retrieve its memory in P3.

We now turn to a type 1 information acquisition task, Active Visual Match, that truly demands long-term temporal credit assignment. Here, in P1 the agent must actively seek out a colored square, randomly located in a two-room maze, so that it can accurately decide on the match in P3 (Figure 2a). If an agent finds the visual cue by chance in P1, then it can use this information in P3, but this will only be successful at random. As in Passive Visual Match, the agent engages in a 30 second distractor task of apple collection during P2. When the rewards of P2 apples were set to 0, RMAs with discount factors sufficiently close to 1 were able to solve the task (Figure 2b, dashed lines). With a randomized number of apples worth one point each, the RMAs with ultimately began to learn the task (Figure 2b, solid line, medium blue) but were slower in comparison to the no P2 reward case. For a fixed mean reward per episode in P2 but increasing variance, RMA agent performance degraded entirely (Supplementary Figure 5). Finally, for the principal setting of the level, where each P2 apple is worth five points, and the P2 reward variance is , all comparison models (the LSTM agent, LSTM+Mem agent, and RMA) failed to learn P1 behavior optimized for P3 (Figure 2d). For , RMAs reached a score of about 4.5, which implies slightly better than random performance in P3: this was because RMAs solved the task in cases where they accidentally sighted the cue in P1.

Temporal Value Transport

Temporal Value Transport (TVT) is a learning algorithm that augments the capabilities of memory-based agents to solve long-term temporal credit assignment problems. The insight behind TVT is that we can combine attentional memory access with reinforcement learning to fight variance by automatically discovering how to ignore it, effectively transforming a problem into one with no delay at all. A standard technique in RL is to estimate the return for the policy gradient calculation by bootstrapping[7]: using the learned value function, which is deterministic and hence low variance but biased, to reduce the variance in the return calculation. We denote this bootstrapped return as . The agent with TVT (and the other agent models considered here) likewise bootstraps from the next time step and uses a discount factor to reduce variance further. However, it additionally bootstraps from the distant future.

Figure 3: Analysis of Agent in Active Visual Match. a. In P1, TVT trained on Active Visual Match, actively sought out and oriented to the colored squared. RMA meandered randomly. b. Its attentional read weights focused maximally on the memories from time points when it was facing the colored square. c. With statistics gathered over 20 episodes, TVT’s average value function prediction in P1 (blue) was larger than the actual discounted reward trace (green) – due to the transported reward. Difference shown in gray. The RMA value function in contrast matched the discounted return very closely. d. The P3 rewards for TVT rose during learning (upper panel) after the maximum read strength per episode first crossed threshold on average (lower panel, red line).

In Figure 2c, we highlight the basic principle behind TVT. We previously saw in the Passive Visual Match task that the RMA reading mechanism learned to retrieve a memory from P1 in order to produce the value function prediction and policy in P3. This was a purely automatic process determined by the needs of the agent in P3. We exploit this phenomenon to form a link between the time point (occurring, for example, in P3) at which the retrieval occurs and the time at which the retrieved memory was encoded. This initiates a splice event in which the bootstrapped return calculation for is revaluated to , where is a form of discount factor that diminishes the impact of future value over multiple stages of TVT. From the perspective of learning at time , the credit assignment is conventional: the agent tries to estimate the value function prediction based on this revaluated bootstrapped return, and it calculates the policy gradient based on it as well. The bootstrapped return can trivially be regrouped as , which facilitates the interpretation of the transported value as a fictitious reward introduced to time step .

  input: , , read strengths , read weights
  splices : = []
  for each crossing of read strength above  do
     Let be maximum over of in crossing window
     Append to splices
  end for
  for  in 1 to T do
     for  in splices do
        if  then
           {The read based on influences value prediction at next step, hence .}
        end if
     end for
  end for
Algorithm 1 Temporal Value Transport for One Read

This characterization is broadly how TVT works. However, in detail there are multiple practical issues to understand further. First, the TVT mechanism only triggers a splice event when a memory retrieval is sufficiently strong: in particular, this occurs whenever a read strength is above a threshold value, . Second, each of the memory reading processes operates in parallel, and each can independently trigger a splice event. Third, instead of linking to a single past event, the value at the time of reading is transported back to all time points with a strength proportional to the attentional weighting . Fourth, value is not transported to events that occurred very recently, where recently is any time within one half-life of the reading time . Pseudocode for the TVT algorithm is shown in Algorithm 1, and further implementation details are discussed in Supplement Section 5.

When applied to the Active Visual Match task with large distractor reward, an RMA model equipped with TVT (henceforth just TVT) learned the behavior in P1 that produced distal reward in P3; it also learned the task faster than did any RMA with no distractor reward (Figure 2b&d). The difference in learned behavior was dramatic: TVT reliably sought out and oriented toward the colored square in P1, while the standard RMA behaved randomly (Figure 3a). Figure 3b overlays on the agent’s trajectory (arrowheads) a coloring based on the read weight produced at the time of a TVT splice event in P3: TVT learned to read effectively from memories in P1 associated with the time points for which it was viewing the colored square. During the learning process, we see that the maximum read strength recorded per episode (Figure 3d, lower panel) began to reach threshold (lower panel, red line) early and prior to producing P3 reward reliably (Figure 3d, upper panel), which then instigated the learned behavior in P1. After training, TVT’s value function prediction directly reflected the fictitious rewards. Averaged over 20 trials, the value function in P1 (Figure 3c, left panel, blue curve) was higher than the actual discounted return, , (Figure 3c, left panel, green curve). The RMA agent with discounting did not show a similar difference between the discounted return and the value function (Figure 3c, right panel). In both Figure 3c panels, we see bumps in P3 in the return traces due to the distal reward: TVT achieved higher reward in general, with the RMA return reflecting only chance performance. Further, we examined whether TVT could solve problems with even longer distractor intervals, in this case with a P2 interval of 60 seconds. TVT also learned here (Supplementary Figure 6).

Figure 4: Type 2 Causation Tasks. a. First person (upper row) and top-down view (lower row) in Key-to-Door task. The agent (indicated by yellow arrow) must pick up a key in P1 (black arrow), collect apples in P2, and, if it possesses the key, it can open the door (green arrow) in P3 to acquire the distal reward (blue arrow). b. Learning curves for P3 reward (TVT in black). Although this task requires no memory for the policy in P3, computing the value prediction still triggers TVT splice events, which promote key retrieval in P1. c.

Increasing the standard deviation of reward available in P2 disrupted the performance of LSTM agents at acquiring the distal reward.

d. On 20 trials produced by a TVT agent, we compared the variance of the TVT bootstrapped return against the undiscounted return. The TVT return’s variance was orders of magnitude lower. Vertical lines mark phase boundaries. e. Saliency analysis of the pixels in the input image in P1 that the value function gradient is sensitive to. The key pops out in P1.

Temporal Value Transport can also solve type 2 causation tasks, where the agent does not need to acquire information in P1 for P3 but instead must cause an event that will affect the state of the environment in P3. Here, we study the Key-to-Door (KtD) task in which an agent must learn to pick up a key in P1 so that it can unlock a door in P3 to obtain reward (Figure 4a). Although no information from P1 must be recalled in P3 to inform the policy’s actions (the optimal decision is to move toward the door in P3 regardless of the events in P1), TVT still learned to acquire the key in P1 because it read from memory to predict the value function when positioned in front of the door in P3 (Figure 4b, black), while all other agents failed to pick up the key reliably in P1 (Figure 4b blue, pink, green). In this case, the P2 reward variance was comparatively low – with the only variance due to a randomized number of apples but with each apple consistently giving . In higher SNR conditions (low P2 reward variance), even LSTM agents with were able to solve the task, indicating that a large memory itself is not the primary factor in task success (Figure 4c). TVT specifically assisted in credit assignment. However, the LSTM agents could learn only for small values of P2 reward variance, and performance degraded predictably as a function of increasing reward variance in P2 (Figure 4c, dark to light green curves). For the same setting as Figure 4b, we calculated the variance of either the TVT bootstrapped return for each time point, over 20 episodes, and compared on the same episodes to the variance of the undiscounted return, (Figure 4d). Because it exploits discounting, the variance of the bootstrapped return of TVT was nearly two orders of magnitude smaller in P1. We next asked if the agent attributed the fictitious reward transported to P1 in an intelligent way to the key pickup. In P1, using a saliency analysis similar to[26], we calculated the gradient of the value function prediction with respect to the input image and shaded the original input image in proportion to the magnitude of this quantity (Supplement Section 8.2). In Figure 4e, we see that this produced a segmentation of the key, indicating that the P1 value prediction was most sensitive to the observation of the key. As a control experiment, in Supplementary Figure 7, we tested if there needed to be any surface-level similarity between visual features in P3 and the encoded memory in P1 for memory retrieval to function. With a blue instead of a black key, TVT also solved the task as easily, indicating that the memory searches could flexibly find information with a somewhat arbitrary relationship to current context.

One can understand how TVT learned to solve this task as a progression. Initially, on a small fraction of the episodes, the agent picked up the key at random. From this point, the agent learned, on encountering the door, to retrieve memories from P1 that identified if the agent picked up the key in order to predict the return in P3 accurately (this is what RMA did as well). Whenever the memories from P1 were retrieved, splice events were triggered that transported value back to the behavioral sequences in P1 that led to key pickup.

Figure 5: Transport across Multiple Phases. a. Key-to-Door-to-Match (KtDtM) task. The agent (yellow arrow) must pick up a key (black arrow) in P1, to open a door (green arrow) and encode a colored square (red arrow) in P3, to select the matching colored square in P5. P2 and P4 are distractor apple collecting tasks. b. TVT (black) solved this task, whereas RMA (blue) solved the P5 component of the task when it by chance retrieved the P1 key and opened the door in P3. c. The value function prediction (blue) in TVT developed two humps where it was above the discounted return trace (green), one in P1, one in P3, encoding the value of achieving the “sub-goals” in P1 and P3.

The introduction of transported value can come at a cost. When a task has no actual need for long-term temporal credit assignment, spurious triggering of splice events can send value back to earlier time points and bias the agent’s activity. To study this issue, we examined performance of TVT on a set of independently developed RL tasks that were designed in a context where standard discounted RL was expected to perform well. We compared the performance of the LSTM agent, the LSTM+Mem agent, RMA, and TVT. TVT generally performed on par with RMA on many tasks but slightly worse on one (Supplementary Figures 8-9) and outperformed all of the other agent models, including LSTM+Mem. We also considered whether TVT would function when P3 reward was strictly negative, but a behavior in P1 could be developed to avert a larger disaster. In the Two Negative Keys task, the agent is presented with a blue key and red key in a room in P1. If the agent picks up the red key, it will be able to retrieve a distal reward behind a door in P3 worth ; if it picks up the blue key, it will be able to retrieve a distal reward worth , and if it does not pick up a key at all, it is penalized in P3. TVT was also able to solve this task (Supplementary Figure 10).

Having established that TVT was able to solve relatively simple problems, we now demonstrate TVT’s capability in two more complex scenarios. The first of these is an amalgam of the KtD and the Active Visual Match task, which demonstrates temporal value transport across multiple phases – the Key-to-Door-to-Match task (KtDtM); here, an agent must exhibit two non-contiguous behaviors to acquire the distal reward.

In this task, instead of a three phase structure, we have five phases: P1-P5 (Figure 5a). P2 and P4 are both long distractor phases involving apple collection distractor rewards. In P1 and P3, there are no rewards. In P1, the agent must fetch a key, which it will use in P3 to open a door to see a colored square. In P5, the agent must choose the groundpad in front of the colored square matching the one that was behind the door in P3. If the agent does not pick up the key in P1, it is locked out of the room in P3 and cannot make the correct choice in P5. TVT solved this task reliably (Figure 5b), whereas all other agents solved this problem only at chance level in P5, and did not pursue the key in P1. As might be expected, the TVT value function prediction rose in both P1, P3, and P5 (Figure 5c) with two humps where the P1 and P3 value functions were above the discounted return traces. Because the discount factor for TVT transport was relatively large (0.9), the two humps in the value prediction were of comparable magnitude.

Figure 6: More Complex Information Acquisition. a. In Latent Information Acquisition, the agent (yellow arrow) must touch three procedurally generated objects to identify from a subsequent color flash if each is either green or red. In P3, green objects yield positive reward and red objects negative. b. TVT performed well on this task (black curve). c. In 20 trials, we plot the positional coverage in P1 of a TVT agent compared to RMA. TVT developed exploratory behavior in P1: it navigated among the six possible locations where the P1 objects could be placed, whereas the RMA typically moved into the corner. d. A quantification over 20 trials of the exploratory behavior in P1: TVT usually touched all three of the objects in P1, whereas RMA touched about one.

Finally, we look at a richer information acquisition task, Latent Information Acquisition (Figure 6a). In P1, the agent begins in a room surrounded by three objects with random textures and colors drawn from a set. During P1, each object has no reward associated with it. When an object is touched by the agent, it disappears and a color swatch (green or red) appears on the screen. Green swatches indicate that the object is good, and red swatches indicate it is bad. The number of green- and red-associated objects was balanced on average. In P2, the agent again collects apples for 30 seconds. In P3, the agent must collect only the objects that were associated with a green swatch.

The TVT agent alone was able to solve the task (Figure 6b, black curve), usually touching all three objects in P1 (Figure 6d), while the RMA only touched one object on average, and it outperformed non-TVT agents by a wide margin (Figure 6b, other colors). The non-TVT agents all exhibited pathological behavior in P1. In P1, the objects were situated on a grid of six possible locations (with no relationship to P3 location). TVT learned an exploratory sweeping behavior whereby it efficiently covered the locations where the objects were present (Figure 6c), whereas RMA reliably moved into the same corner, thus touching by accident only one object.


The mechanism of TVT should be compared to other recent proposals to address the problem of long-term temporal credit assignment. The Sparse Attentive Backtracking algorithm[27]

in a supervised learning context uses attentional mechanisms over the states of an RNN to propagate backpropagation gradients effectively. The idea of using attention to the past is shared with our work; however, there are substantial differences. Instead of propagating gradients to shape network representations, in the RMA we have used temporally local reconstruction objectives to ensure relevant information is encoded and stored in the memory. Further, backpropagating gradients to RNN states would not actually train a policy’s action distribution, which is the crux of reinforcement learning. Our approach instead modifies the rewards from which the full policy gradient is derived. Like TVT, the RUDDER algorithm

[28] has recently been proposed in the RL context to address the problem of learning from delayed rewards. RUDDER uses an LSTM to make predictions about future returns and sensitivity analysis to decompose those returns into reward packets distributed throughout the episode. TVT is explicitly designed to use a reconstructive memory system to compress high-dimensional observations in partially-observed environments and retrieve them with content-based attention. At present, we know of no other algorithm that can solve type 1 information acquisition tasks.

Temporal Value Transport is a heuristic algorithm but one that expresses coherent principles we believe will endure: past events are encoded, stored, retrieved, and revaluated. TVT fundamentally intertwines memory systems and reinforcement learning: the attention weights on memories specifically modulate the reward credited to past events. While not intended as a neurobiological model, the notion that neural memory systems and reward systems are highly co-dependent is supported by much evidence, including the existence of direct dopaminergic projections to hippocampal CA1 and the contribution of D1/D5 dopamine receptors in acquiring task performance in awake-behaving animals[29, 30].

Throughout this work, we have seen that standard reinforcement learning algorithms are compromised when solving even simple tasks requiring long-term behavior. We view discounted utility theory, upon which almost all reinforcement learning is predicated, as the ultimate source of the problem, and our work provides evidence that other paradigms are not only possible but can work better. In economics, paradoxical violation of discounted utility theory has occasioned bountiful scholarship and diverse, incompatible, and incomplete theories[14]. We hope that a cognitive mechanisms approach to understanding “inter-temporal choice” – in which preferences and long-term economic behavior are decoupled from a rigid discounting model – will inspire new ways forward. The principle of splicing together remote events based on episodic memory access may offer a promising vantage from which to begin future study of these issues.

The complete explanation of the remarkable ability of human beings to problem solve and express coherent behaviors over long spans of time remains a profound mystery about which our work only provides a smattering of insight. TVT learns slowly, whereas humans are at times able to discover causal connections over long intervals quickly (albeit sometimes inaccurately). Human cognitive abilities are often conjectured to be fundamentally more model-based than the mechanisms in most current reinforcement learning agents (TVT included)[31] and can provide consciously available causal explanations[32] for events. When the book is finally written on the subject, it will likely be understood that long-term temporal credit assignment recruits nearly the entirety of the human cognitive apparatus, including systems designed for prospective planning, abstract, symbolic, and logical reasoning, commitment to goals over indefinite intervals, and language. Some of this human ability may well require explanation on a different level of inquiry altogether: among different societies, attitudes and norms regarding savings rates and investment vary enormously[33]. There is in truth no upper limit to the time horizons we can conceptualize and plan for.


Correspondence should be addressed to Greg Wayne, Chia-Chun Hung, or Timothy Lillicrap (email: gregwayne, aldenhung, countzero@google.com).



1 Agent Model

At a high level, the Reconstructive Memory Agent (RMA) consists of four modules: an encoder for processing observations at each time step; a memory augmented recurrent neural network, which contains a deep LSTM “controller” network and an external memory that stores a history of the past; its output combines with the encoded observation to produce a state variable representing information about the environment (state variables also constitute the information stored in memory); a policy that takes the state variable and the memory’s recurrent states as input to generate an action distribution; a decoder, which takes in the state variable, and predicts the value function as well as all current observations.

We now describe the model in detail by defining its parts and the loss functions used to optimise it. Parameters given per task are defined in Table 


1.1 Encoder

The encoder is composed of three sub-networks: the image encoder, the action encoder, and the reward encoder. These act independently on the different elements contained within the input set , where is the current observed image, and and are the action and reward of previous time step. The outputs from these sub-networks are concatenated into a flat vector .

1.1.1 Image Encoder

The image encoder takes in image tensors

of size (3 channel RGB). We then apply 6 ResNet [34]

blocks with rectified linear activation functions. All blocks have 64 output channels and bottleneck channel sizes of 32. The strides for the 6 blocks are

, resulting in 8-fold spatial down-sampling of the original image. Therefore, the ResNet module outputs tensors of size . We do not use batch-norm [35], a pre-activation function on inputs, or a final activation function on the outputs. Finally, the output of the ResNet is flattened (into a element vector) and then propagated through one final linear layer that reduces the size to 500 dimensions, whereupon a nonlinearity is applied.

1.1.2 Action Encoder

In all environments, the action from the previous time step is a one-hot binary vector (6-dimensional here) with . We use an identity encoder that leaves the action one-hot unchanged.

1.1.3 Reward Encoder

The reward from the previous time step is represented as a scalar and is not processed further.

1.2 Decoder

The decoder is composed of four sub-networks. Three of these sub-networks are matched to the encoder sub-networks of image, previous action, and previous reward. The additional sub-network decodes the value function.

1.2.1 Image Decoder

The image decoder has the same architecture as the encoder except the operations are reversed. In particular, all 2D convolutional layers are replaced with transposed convolutions [36]. Additionally, the last layer produces a number of output channels that is formatted to the likelihood function used for the image reconstruction loss, described in more detail in Eq. 8.

1.2.2 Action and Reward Decoders

The reward and action decoders are both linear layers from the state variable, , to, respectively, a scalar dimension and the action cardinality.

1.2.3 Value Function Predictor

The value function predictor is a multi-layer perceptron (MLP) that takes in the concatenation of the state variable with the action distribution’s logits, where, to ensure that the value function predictor learning does not modify the policy, we block the gradient (stop gradient) back through to the policy logits. The MLP has a single hidden layer of

hidden units and a activation function, which then projects via another linear layer to a 1-dimensional output. This function is a state-value function .

1.3 Memory-Augmented RNN

The RNN is primarily based on a simplification of the Differentiable Neural Computer (DNC) [24]. It is composed of a deep LSTM and a slot-based external memory. The LSTM has recurrent state (output state and cells, respectively). The memory itself is a two-dimensional matrix of size , where is the same size as a state variable. The memory at the beginning of each episode is initialised blank, namely . We also carry the memory readouts , which is a list of vectors read from the memory , as recurrent state.

At each time step, the following steps are taken sequentially:

  1. Generate the state variable with , , and as input.

  2. Update the deep LSTM state with .

  3. Construct the read key and read from the external memory.

  4. Write the state variable to a new slot in the external memory.

1.4 State Variable Generation

The first step is to generate a state variable, , combining both the new observation with the recurrent information. We take the encoded current observation concatenated with the recurrent information and as input through a single hidden-layer MLP with the hidden layer of size units and output layer of size .

1.5 Deep LSTMs

We use a deep LSTM [37] of two hidden layers. Although the deep LSTM model has been described before, we describe it here for completeness. Denote the input to the network at time step as . Within a layer , there is a recurrent state and a “cell” state , which are updated based on the following recursion (with ):

To produce a complete output , we concatenate the output vectors from each layer: . These are passed out for downstream processing.

1.6 LSTM Update

At each time step , the deep LSTM receives input , which is then concatenated with the memory readouts at the previous time step . The input to the LSTM is therefore . The deep LSTM equations are applied, and the output is produced.

1.7 External Memory Reading

A linear layer is applied to the LSTM’s output to construct a memory interface vector of dimension . is then segmented into vectors read keys of length and scalars , which are passed through the function to create the scalars .

Memory reading is executed before memory writing. Reading is content-based. Reading proceeds by computing the cosine similarity between each read key

and each memory row : . We then find indices corresponding to the largest values of (over index ). Note that since unwritten rows of are equal to the zero vector, some of the chosen may correspond to rows of that are equal to the zero vector.

A weighting vector of length is then computed by setting:

For each key, the readout from memory is . The full memory readout is the concatenation across all read heads: .

1.8 External Memory Writing

Writing to memory occurs after reading, which we also define using weighting vectors. The write weighting has length and always appends information to the -th row of the memory matrix at time , i.e., (using the Kronecker delta). The information we write to the memory is the state variable . Thus, the memory update can be written as


1.9 Policy

The policy module receives , , and as inputs. The inputs are passed through a single hidden-layer MLP with 200 units. This then projects to the logits of a multinomial softmax with the dimensionality of the action space. The action is sampled and executed in the environment.

2 Loss Functions

We combine a policy gradient loss with reconstruction objectives for decoding observations. We also have a specific loss that regularizes the use of memory for TVT.

2.1 Reconstruction Loss

The reconstruction loss is the negative conditional log-likelihood of the observations and return , which is factorised into independent loss terms associated with each decoder sub-network and is conditioned on the state variable . We use a multinomial softmax cross-entropy loss for the action, mean-squared error (Gaussian with fixed variance of 1) losses for the reward and the value function, and a Bernoulli cross-entropy loss for each pixel channel of the image. Thus, we have a negative conditional log-likelihood loss contribution at each time step of



On all but the standard RL control experiment tasks, we constructed the target return value as . For the standard RL control experiment tasks with episodes of length , we use “truncation windows” [38] in which the time axis is subdivided into segments of length . We can consider full gradient as a truncated gradient with . If the window around time index ends at time index , the return within the window is


As a measure to balance the magnitude of the gradients from different reconstruction losses, the image reconstruction loss is divided by the number of pixel-channels .

2.2 Policy Gradient

We use discount and bootstrapping parameters and , respectively, as part of the policy advantage calculation given by the Generalised Advantage Estimation (GAE) algorithm[39]. Defining , Generalised Advantage Estimation makes an update of the form:


There is an additional loss term that increases the entropy of the policy’s action distribution. This and pseudocode for all of RMA’s updates are provided in Algorithm 2.

2.3 Temporal Value Transport Specific Loss

We include an additional regularization term described in Section 5.3.

3 Comparison Models

We introduce two comparison models: the LSTM+Mem Agent and the LSTM Agent.

3.1 LSTM+Mem Agent

The LSTM+Mem Agent is similar to the RMA. The key difference is that it has no reconstruction decoders and losses. The value function is produced by a one hidden-layer MLP with 200 hidden units: .

3.2 LSTM Agent

The LSTM Agent additionally has no external memory system and is essentially the same design as the A3C agent [38]. We have retrofitted the model to share the same encoder networks as the RMA, acting on input observations to produce the same vector . This is then passed as input to a deep 2-layer LSTM that is the same as the one in RMA. The LSTM has two output “heads”, which are both one hidden-layer MLPs with 200 hidden units: one for the policy distribution and one for the the value function prediction . As for our other agents, the policy head is trained using Eq. 10.

4 Implementation and Optimisation

For optimisation, we used truncated backpropagation through time [40]. We ran 384 parallel worker threads that each ran an episode on an environment and calculated gradients for learning. Each gradient was calculated after one truncation window, . For all main paper experiments other than the standard RL control experiments, , the length of the episode.

The gradient computed by each worker was sent to a “parameter server” that asynchronously ran an optimisation step with each incoming gradient. We optimise the model using ADAM optimisers [41] with and .

The pseudocode for each RMA worker is presented in Algorithm 2.

  // Assume global shared model parameter vectors and counter
  // Assume thread-specific parameter vectors
  // Assume discount factor and bootstrapping parameter
  Initialize thread step counter
     Synchronize thread-specific parameters
     Zero model’s memory & recurrent state if new episode begins
         // (Memory-augmented RNN)
        Update memory
        Policy distribution
        Apply to environment and receive reward and observation
     until environment termination or
     If not terminated, run additional step to compute
     and set // (but don’t increment counters)
     (Optional) Apply Temporal Value Transport (Alg. 3)
     Reset performance accumulators
     for  from down to  do
         // (Entropy loss)
         (Eq. 8)
     end for
     Asynchronously update via gradient ascent using
Algorithm 2 RMA Worker Pseudocode

For all experiments, we used the open source package Sonnet – available at https://github.com/deepmind/sonnet – and applied its defaults to initialise network parameters.

5 Temporal Value Transport

Temporal Value Transport works in two stages. First, we identify significant memory read events, which become splice events. Second, we transport the value predictions made at those read events back to the time points being read from, where they modify the rewards and therefore the RL updates.

5.1 Splice Events

At time , the read strengths are calculated as described in 1.7. To exclude sending back value to events in the near past, for time points where , we reset for the remainder of the computation. We then identify splice events by first finding all time windows where for but for and .

We then set to be the over of in the period for the included points.

5.2 Reward Modification

For each above, we modify the reward of all time points occurred more than steps beforehand:


We send back because that is the first value function prediction that incorporates information from the read at time . Additionally, for multiple read processes , the process is the same, with independent, additive changes to the reward at any time step. Pseudocode for Temporal Value Transport with multiple read processes is provided in Algorithm 3.

5.3 Reading Regularization

To prevent the TVT mechanism from being triggered extraneously, we impose a small regularization cost whenever a read strength is above threshold.


This is added to the other loss terms.

  input: , , read strengths , read weights
  for  do
     for  do
        if  then
        end if
     end for
     for each crossing of read strength above  do
        Let be maximum over of in crossing window
        Append to splices
     end for
     for  in 1 to T do
        for  in splices do
           if  then
           end if
        end for
     end for
  end for
Algorithm 3 Temporal Value Transport for Multiple Reads

6 Signal-to-Noise Ratio Analysis

6.1 Undiscounted Case

As in the article text, we refer to phases 1-3 as P1-P3. We define the signal as the squared norm of the expected policy change in P1 induced by the policy gradient. Let . Further, in the following assume that the returns are baseline-subtracted . Then we define the signal as


We define the noise as the variance of the policy gradient


P1 and P2 are approximately independent as P2 is a distractor phase whose initial state is unmodified by activity in P1. The only dependence is given by the agent’s internal state and parameters, but we assume for these problems it is a weak dependence, which we ignore for present calculations. In this case,


So we have


For Noise, we have


where is the variance in the policy gradient due to P1 and P3 without a P2 distractor phase. We make the assumption that performance in P2 is independent of activity in P1, which is approximately the case in the distractor task we present in the main text. With this assumption, the first term above becomes

Term 1

Thus, the SNR (Signal / Noise) is approximately

In the limit of large P2 reward variance, we have

The reward variance in P2, , reduces the policy gradient SNR, and low SNR is known to impact the convergence of stochastic gradient optimization negatively[19]. Of course, averaging independent episodes increases the SNR correspondingly to , but the approach of averaging over an increasing number of samples is not universally possible and only defers the difficulty: there is always a level of reward variance in the distractor phase that matches or overwhelms the variance reduction given by averaging.

7 Tasks

All tasks were implemented in DeepMind Lab (DM Lab) [42]. DM Lab has a standardized environment map unit length: all sizes given below are in these units.

7.1 Observation and Action Repeats

For all DM Lab experiments, agents processed 15 frames per second. The environment itself produced 60 frames per second, but we propagated only the first observation of each packet of four to the agents. Rewards accumulated over each packet were summed together and associated to the first, undropped frame. Similarly, the agents chose one action at the beginning of this packet of four frames: this action was applied four times in a row. We define the number of “Agent Steps” as the number of independent actions sampled by the agent: that means one agent step per packet of four frames.

7.2 Action Sets

We used a consistent action set for all experiments except for the Arbitrary Visuomotor Mapping task. For all other tasks, we used a set of six actions: move forward, move backward, rotate left with rotation rate of 30 (mapping to an angular acceleration parameter in DM Lab), rotate right with rotation rate of 30, move forward and turn left, move forward and turn right. For the Arbitrary Visuomotor Mapping, we did not need to move relative to the screen, but we instead needed to move the viewing angle of the agent. We thus used four actions: look up, look down, look left, look right (with rotation rate parameter of 10).

7.3 Themes

DM Lab maps use texture sets to determine the floor and wall textures. We use a combination of four different texture sets in our tasks: Pacman, Tetris, Tron and Minesweeper. DM Lab texture sets can take on various colours but we use the default colours for each set, which are: Pacman: blue floors and red walls. Tetris: blue floor and yellow walls. Tron: yellow floor and green walls. Minesweeper: blue floor and green walls. Examples of how these sets appear can be seen in various figures in the main text.

7.4 Task Phases

Episodes for the tasks with delay intervals are broken up into multiple phases. Phases do not repeat within an episode. Generally, the tasks contain three phases (P1-P3), with a middle phase.

We used a standardized P2 distractor phase task: the map is an open square (Figure 1b second column). The agent spawns (appears) adjacent to the middle of one side of the square, facing the middle. An apple is randomly spawned independently at each unit of the map with probability , except for the square in which the agent spawns. Each apple gives a reward of 5 when collected and disappears after collection. The agent remains in this phase for 30 seconds. (This length was varied in some experiments.) The map uses the Tetris texture set unless mentioned otherwise.

7.5 Cue Images

In several tasks, we use cue images to provide visual feedback to the agent, e.g., indicating that an object has been picked up. These cue images are colored rectangles that overlay the input image, covering the majority of the top half of the image. An example of a red cue image is shown in Supplementary Figure 10a, third panel. These cues are shown for 1 second once activated, regardless of a transition to a new phase that may occur during display.

7.6 Primary Tasks

7.6.1 Passive Visual Match

In each episode of Passive Visual Match, four distinct colors are randomly chosen from a fixed set of 16 colors. One of these is selected as the target color and the remaining three are distractor colors. Four squares are generated with these colors, each the size of one wall unit. The three phases in each episode are:

  1. The map is a corridor with a target color square covering the wall unit at one end. The agent spawns facing the square from the other end of the corridor (Figure 1b first column). There are no rewards in this phase. The agent remains in this phase for 5 seconds. The map uses the Pacman texture set.

  2. The standard distractor phase described above.

  3. The map is a rectangle with the four color squares (the target color and three distractor colors) on one of the longer sides, with a unit gap between each square. The ordering of the four colors is randomly chosen. There is an additional single unit square placed in the middle of the opposite side, in which the agent spawns, facing the color squares. In front of each color square is a groundpad (Figure 1b last two columns). When the agent touches one of these pads, a reward of 10 points is given if it is the pad in front of the target painting and a reward of 1 is given for any other pad. The episode then ends. If the agent does not touch a pad within 5 seconds then no reward is given for this phase and the episode ends. The map uses the Tron texture set.

7.6.2 Active Visual Match

Active Visual Match is the same as Passive Visual Match, except that the map in phase 1 is now larger and the position of the target image in phase 1 is randomized. The phase 1 map consists of two open squares connected by a corridor that joins each square in the middle of one side (Figure 2a first two columns). The agent spawns in the center of one of the two squares, facing the middle of one the walls adjacent to the wall with the opening to the corridor. The target color square is placed randomly over one of any of the wall units on the map.

7.6.3 Key-to-Door

The three phases of Key-to-Door are:

  1. The map is identical to the map in phase 1 of Active Visual Match. The agent spawns in the corner of one the squares that is furthest from the opening to the corridor, facing into the square but not towards the opening. A key is placed randomly within the map (not at the spawn point) and if the agent touches the key it disappears and a black cue image is shown (Figure 4a first two columns). As in the Visual Match tasks, there are no rewards in this phase, and the phase lasts for 5 seconds. The map uses the Pacman texture set.

  2. The standard distractor phase.

  3. The map is a corridor with a locked door in the middle of the corridor. The agent spawns at one end of the corridor, facing the door. At the end of the corridor on the other side of the door is a goal object (Figure 4a fourth column). If the agent touched the key in phase one, the door can be opened by walking into it, and then if the agent walks into the goal object a reward of 10 points is given. Otherwise, no reward is given. The map uses the Tron texture set.

7.6.4 Key-to-Door-to-Match

This task combines elements of Key-to-Door with Active Visual Match. One target color and three distractor colors are chosen in the same way as for the Visual Match tasks. In contrast to the standard task setup, there are five phases per episode:

  1. This phase is the same as phase 1 of Key-to-Door but with a different map. The map is a open rectangle with an additional square attached at one corner, with the opening on the longer of the two walls. The agent spawns in the additional square, facing into the rectangle (Figure 5a first column). The map uses the Minesweeper texture set.

  2. The standard distractor phase except that the phase lasts for only 15 seconds instead of 30 seconds.

  3. The map is the same as in phase 3 of Key-to-Door. Instead of a goal object behind the locked door, the target color square covers the wall at the far end of the corridor (Figure 5a third column). There is no reward in this phase, and it lasts for 5 seconds. The map uses the Pacman texture set.

  4. The standard distractor phase except that the phase lasts for only 15 seconds instead of 30 seconds.

  5. The final phase is the same as phase 3 in the Visual Match tasks.

7.6.5 Two Negative Keys

The three phases of Two Negative Keys are:

  1. The map is a open rectangle. The agent spawns in the middle of one of the shorter walls, facing into the rectangle. One red key is placed in a corner opposite the agent, and one blue key is placed in the other corner opposite the agent. Which corner has the red key and which the blue key is randomized per episode. If the agent touches either of the keys, a red or blue cue image is shown according to which key the agent touched (Supplementary Figure 10 first three columns). After one key is touched, it disappears, and nothing happens if the agent goes on to touch the remaining key (i.e., no cue is displayed and the key remains in the map). The phase lasts for 5 seconds, and there are no rewards; if the agent does not touch any key during this period, at the end of the phase a black cue image is shown. The map uses the Tron texture set.

  2. The standard distractor phase except with the Tetris texture set.

  3. The layout is the same as in phase 3 of the Key-to-Door task. If the agent has picked up either of the keys then the door will open when touched, and the agent can collect the goal object, at which point it will spawn back into the map from phase 2 but with all remaining apples removed. This phase lasts for only 2 seconds in total; when it ends, a reward of -20 is given if the agent did not collect the goal object; a reward of -10 is given if the agent collected the goal object after touching the blue key; and a reward of -1 is given if the agent collected the goal object after touching the red key. The map uses the Tron texture set.

7.6.6 Latent Information Acquisition

In each episode, three objects are randomly generated using the DM Lab object generation utilities. Color and type of object is randomized. Each object is independently randomly assigned to be a good or a bad object.

  1. The map is a rectangle. The agent spawns in one corner facing outwards along one of the shorter walls. The three objects are positioned randomly among five points as displayed in Figure 6c in the main text (Figure 6a first four columns). If an agent touches one of the good objects, it disappears, and a green cue image is shown. If an agent touches one of the bad objects, it disappears, and a red cue image is shown. This phase lasts for 5 seconds, and there are no rewards. The map uses the Tron texture set. The image cues shown in this phase are only shown for 0.25 seconds so that the cues do not interfere with continuation of the P1 activity (in all other tasks they are shown for 1 second).

  2. The standard distractor phase except with the Tetris texture set.

  3. The map, spawn point, and possible object locations are the same as in phase 1. The objects are the same, but their positions are randomly chosen again. If the agent touches a good object it disappears, and a reward of 20 is given. If the agent touches a bad object it disappears and a reward of -10 is given. This phase lasts for 5 seconds. The map uses the Tron texture set.

7.7 Distractor Phase Modifications

In order to analyze the effect of increasing variance of distractor reward on agent learning, we created variants of the distractor phase where this reward variance could be easily controlled. Since the distractor phase is standardized, any of these modifications can be used in any of those tasks.

7.8 Zero Apple Reward

The reward given for apples in the distractor phase is zero. Even though the apples give zero reward, they still disappear when touched by the agent.

7.9 Fixed Number of Apples

The reward given for apples remains at 5. Instead of the 120 free squares of the map independently spawning an apple with probability 0.3, we fix the number of apples to be and distribute them randomly among the 120 available map units. Under an optimal policy where all apples are collected, this has the same mean reward as the standard distractor phase but with no variance.

7.10 Variable Apple Reward

The reward given for apples in the distractor phase can be modified (to a positive integer value), but with probability each apple independently gives zero reward instead of . Any apple touched by the agent still disappears.

This implies that the optimal policy and expected return under the optimal policy is constant, but variance of the returns increases with . Since there are 120 possible positions for apples in the distractor phase, and apples independently appear in each of these positions with probability , the variance of undiscounted returns in P2, assuming all apples are collected, is


7.11 Control Tasks

Control tasks are taken from the DM Lab 30 task set [42]. The tasks we include had a memory access component to performance. We provide only brief descriptions here since these tasks are part of the open source release of DM Lab available at https://github.com/deepmind/lab/tree/master/game_scripts/levels/contributed/dmlab30.

7.11.1 Explore Goal Locations Small

This task requires agents to find the goal object as fast as possible. Within episodes, when the goal object is found the agent respawns and the goal appears again in the same location. The goal location, level layout, and theme are randomized per episode. The agent spawn location is randomized per respawn.

7.11.2 Natlab Varying Map Randomized

The agent must collect mushrooms within a naturalistic terrain environment to maximise score. The mushrooms do not regrow. The map is randomly generated and of intermediate size. The topographical variation, and number, position, orientation and sizes of shrubs, cacti and rocks are all randomized. Locations of mushrooms are randomized. The time of day is randomized (day, dawn, night). The spawn location is randomized for each episode.

7.11.3 Psychlab Arbitrary Visuomotor Mapping

This is a task in the Psychlab framework[43] where the agent is shown images from a visual memory capacity experiment dataset[44] but in an experimental protocol known as arbitrary visuomotor mapping. The agent is shown consecutive images that are associated to particular cardinal directions. The agent is rewarded if it can remember the direction to move its fixation cross for each image. The images are drawn from a set of roughly 2,500 possible images, and the specific associations are randomly generated per episode.

7.12 Task Specific Parameters

Across models the same parameters were used for the TVT, RMA, LSTM+Mem, and LSTM agents except for , which for the TVT model was always and was varied as expressed in the figure legends for the other models. Learning rate was varied only for the learning rate analysis in Section 8.5.

Across tasks, we used the parameters shown in Table 1 with a few exceptions:

  • For all the control tasks, we used instead of 20.

  • For all the control tasks, we used instead of using the full episode.

  • For the Two Negative Keys task, we used instead of .

Parameter Value
Number of steps in episode
Number of steps in episode
Supplementary Table 1: Parameters used across tasks (not all parameters apply to all models).

8 Task Analyses

8.1 Variance Analysis

For Active Visual Match and Key-to-Door tasks, we performed analysis of the effect of distractor phase reward variance on the performance of the agents. To do this we used the same tasks but with modified distractor phases as described in Section 7.7.

8.2 Active Visual Match

Supplementary Figure 13 shows learning curves for (see Section 7.8) and (see section 7.10). When , all apples give reward. Learning for the RMA was already significantly disrupted when , so for Active Visual Match we do not report higher variance examples.

8.3 Key-to-Door

Figure 4c shows learning curves with apple reward set to 1, 3, 6, and 10, which gives variances of total P2 reward as 25, 100, 196, and 361, respectively, (see section 7.10). Note that episode scores for these tasks show that all apples are usually collected in P2 at policy convergence.

Note that the mean distractor phase return in the previous analysis is much less than the mean return in the standard distractor phase. Another way of looking at the effect of variance in the distractor phase whilst including the full mean return is shown in Supplementary Figure 11, which has three curves: one for zero apple reward (see 7.8), one for a fixed number of apples (see 7.9 and one for the full level (which has a variable number of apples per episode but the same expected return as the fixed number of apples case). From the figure, it can be seen that introducing large rewards slows learning in phase 1 due to the variance whilst the agent has to learn the policy to collect all the apples, but that the disruption to learning is much more significant when the number of apples continues to be variable even after the agent has learnt the apple collection policy.

8.4 Return Prediction Saliency

To generate Figure 4e in the main text, a sequence of actions and observations for a single episode of Key-to-Door was recorded from a TVT agent trained on that level. We show two time steps where the key was visible. We calculated gradients of the agent’s value predictions with respect to the input image at each time step. We then computed the sensitivity of the value function prediction to each pixel:

We smoothed these sensitivity estimates using a 2 pixel-wide Gaussian filter:

We then normalized this quantity based on its statistics across time and pixels by computing the 97th percentile:

Input images were then layered over a black image with an alpha channel that increased to 1 based on the sensitivity calculation. Specifically, we used an alpha channel value of:


8.5 Learning Rate Analysis for High Discount Factor

To check that the learning rates used for the high discount RMA and LSTM models were reasonable, we ran the largest variance tasks from in Section 8.2 (for RMA with ) and 8.3 (for LSTM with ) for learning rates , , ,