Generalization is a key property of reinforcement learning algorithms with function approximation. It is important for an agent to generalize from previous encountered samples to a larger subset of samples which have not been seen. Generalization has been extensively studied in supervised learning, where we normally assume that we can sample iid inputs from a fixed input distribution and the targets are sampled from a fixed conditional distribution.
The assumption of iid inputs, however, does not hold in general. When learning on a correlated stream of data, as in RL, the learner might fit the learned function to recent data and potentially overwrite or forget previously learned information. This issue is called catastrophic interference. Interference occurs even in the iid prediction setting: an update on some set of states is said to interfere with predictions in another state when that update decreases accuracy for that state. This interference is catastrophic if it causes significant forgetting, which is typically only observed with temporally correlated data, such as in RL (bengio2020interference; goodrich2015neuron; liu2019utility) or in the sequential multi-task learning setting (kirkpatrick2017overcoming; riemer2018learning)
. The conventional wisdom is that catastrophic interference is particularly problematic in the control setting in RL, even single-task RL, because (a) when an agent explores, it receives a sequence of observations, which are likely to be temporally correlated; (b) the agent is changing its policy while learning, making the sequence of observations non-stationary; and (c) the agent uses its own estimates as targets (as in temporal difference learning), which makes the target outputs non-stationary.
It is as yet difficult to verify this conventional wisdom, as we do not have effective means to measure interference. It is commonly held that replay, target networks and the choice of representation (liu2019utility) all mitigate interference, and so improve performance. But, without a clear definition and way to measure interference in RL, it is hard to test these hypotheses. There has been work quantifying interference for supervised learning (chaudhry2018riemannian; fort2019stiffness; kemker2018measuring; riemer2018learning), with some empirical work even correlating catastrophic forgetting and properties of task sequences in supervised learning (nguyen2019toward). In prediction, however, the definition of interference is relatively straightforward: interference corresponds to decreases in prediction accuracy, which can be measured using a stored test set. This definition, unfortunately, does not extend to the control setting: if we use value function accuracy, then we have a changing performance measure as the policy changes. Several papers have investigated generalization and transfer in RL (cobbe2018quantifying; farebrother2018generalization; packer2018assessing; rajeswaran2017towards), demonstrating that learning on new environments results in drops in performance on previously learned environments (cobbe2018quantifying), or re-initialization can help a plateaued agent make further progress (fedus2020catastrophic). These works, however, do not directly measure levels of interference, and instead focus on test performance on new environments or new segments of environments.
In this paper, we propose a definition of interference for control in RL using an existing performance measure, called the Optimality Residual (OR). The interference is defined as the change in OR, with two statistics to reflect the presence of catastrophic interference. We evaluate of our interference measures by computing the correlation with several performance metrics, including sample efficiency and stability. We also use these measures to investigate the role of common deep RL techniques, including target networks, experience replay buffer size, mini-batch size, network size, and interference in different layers. It is difficult—or in some cases impossible—to estimate this exact interference measure. We provide an approximation, by deriving an upper bound on the OR, and demonstrate empirically that the approximation is strongly correlated with the exact interference.
In reinforcement learning (RL), an agent interacts with its environment, receiving observations and selecting actions to maximize a reward signal. We assume the environment can be formalized as a Markov decision process (MDP). An MDP is a tuplewhere is a set of states, is an set of actions,
is the transition probability,is the reward function, and a discount factor. The goal of the agent is to find a policy to maximize the expected discounted sum of rewards.
Given a fixed policy , the action-value function is defined as , where denotes the reward at time , i.e. , , and actions are taken according to policy : . The optimal value function is defined as , with the policy that is greedy w.r.t. . The optimal value function can be obtained using the Bellman optimality operator for action values :
is the unique solution of the Bellman equation . Q-learning is built on this operator, iteratively updating to find the fixed point of the Bellman optimality operator.
We can use neural networks to learn an approximation to the optimal action-value. Forthe approximation, with parameters , the online update for non-linear semi-gradient Q learning is
This update with NNs typically leads to unstable performance, so is often augmented with experience replay (lin1993reinforcement) and target networks, introduced in DQN (mnih2015human). Replay consists of storing transitions in a buffer , and performing mini-batch updates sampled from this buffer, per step. Target networks use an older set of parameters for , to make the update target more stationary.
3 A Simple Example Relating Interference and Control Performance
Before discussing our definition and measure of interference, it is useful to use a controlled setting to illustrate how algorithmic choices impact interference and performance. For example, we expect agents with poor representations to suffer from more interference. If we have a very good hand-designed, sparse representation—such as tile-coding—we expect much less interference than a neural network representation that generalizes aggressively. We use three such agents for demonstration: Q-learning with tile-coding, DQN with the Adam optimizer and DQN with the RMSprop optimizer.
The controlled environment, called Two-Rooms, consists of two open rooms with different start and goal states. The trick is that in the first room the agent should navigate up and to the right, and in the second room down and to the left. The input state contains the xy position of the agent, and which room the agent is in. The tile coding agent represents each room independently, whereas DQN is free to generalize across rooms. The agent begins life in one room and trains just long enough (10k steps) to learn a near optimal policy. Then the agent is placed in the second room and trained much longer (70k steps) than required—to the point that over specialization is possible. Finally, the agent is placed back to learn in the first room, to evaluate the impact of extended training in the second room.
In Figure 1, we show online learning curves and the corresponding interference (defined in Section 4.3) in each room separately. Generally, we can see that when the agent is learning, there is interference; the key issue is whether learning in one room interfere with the other. The tile-coding representation—with no features shared between rooms—has no interference in one room, while training in the other. The performance of the DQN agents drops when transfering from room 2 back to room 1. The interference is catastrophic: the agent using RMSProp does not recover the optimal policy, and the agent using Adam learns more slowly than starting from scratch.
The Two-Room example. We plot the learning curve of Q-learning with different architecture choices. The three stages are indicated by the two vertical lines. ETI is a measure of interference, which is defined in a latter section. The curves are averaged over 10 runs with one standard error.
4 Measuring Interference in RL
In this section, we define interference for control in RL. We start by discussing the definition of interference in RL for the prediction setting, where we learn ; we do this for clarity and to provide a contrast to the control setting. We highlight that to define whether an update causes interference requires an answer to the question: interference according to what objective? We propose a natural choice for control: the distance to the optimal action-value function. We discuss two ways to summarize interference over time, to gauge whether an agent has high or low interference.
4.1 Interference in Prediction
In the prediction setting, we estimate for a fixed . A typical measure of prediction error is the mean-squared value error (MSVE), with state-action weighting
To quantify expected interference, we can look at the difference in MSVE before and after an update: . If this value is positive, the update generally degraded performance and there was more interference on average than positive generalization. If this value is negative, the update generally improved performance and there was more positive generalization than interference.
There are existing interference measures based on gradient similarity that could be used for the prediction setting. To see why, assume we can directly minimize the MSVE and so have loss . If we perform an update using
then the interference of that update to one point is . Using a Taylor series expansion, we get the following approximation assuming we have a small step-size :
This approximation corresponds to gradient alignment, which has been used to learn neural networks that are more robust to interference lopez2017gradient; riemer2018learning. They measure if
, to determine if there is positive generalization between two samples; they generally encourage these dot-products to be positive. Other work used gradient cosine similarity, to measure the level of transferability between tasks(du2018adapting), and to measure the level of interference between objectives (schaul2019ray). A somewhat similar measure was used to measure generalization in reinforcement learning (achiam2019towards; bengio2020interference), using the dot product of the gradients of Q functions . This is related in the sense that, for the MSVE with , . This measure neglects the direction of the gradients, and so measures both positive generalization as well as interference.
In all the above, interference is measured relative to a chosen performance objective. This performance objective could even be different than the objective directly optimized by the agent. For example, the agent could optimize the MSPBE, as is done by TD-learning, and performance measured with MSVE. We could also have chosen to define the interference using the MSPBE as the performance objective. This is all to say that defining interference is relative to many givens: we need to clearly specify our performance objective, the update for the weights and what samples are used in that update. The same nuance arises in the control setting, which we discuss next.
4.2 Interference in Control
Given a value estimation , let be the policy with respect to the current estimation . For example, can be the greedy policy w.r.t. . A previously proposed measure (farahmand2011regularization; williams1993tight) for the quality of a policy is the distance between the action-values for that policy and the optimal action-values
We call this the Optimality Residual (OR). The distribution specifies the importance of a state-action pair in the OR. Often, it corresponds to the sampling distribution. For example, where is a start-state distribution and is a behavior policy. Notice that the absolute value is not included in the second line, because for all policies. This objective is one appropriate choice, because the target does not change as the policy changes.
Once we have this objective, the definition for expected interference parallels the prediction setting
where is the mini-batch of data used to update and .
When running experiments in reinforcement learning, where we have a simulator, it is in fact possible to estimate this quantity. One of the primary motivations for measuring interference is to facilitate investigation by researchers. The OR can be estimated simply by using rollouts from a given . The policy can be started from multiple times, generating multiple trajectories. These can be used to get a sample average estimate of the expected return from under . This can then be repeated for . The EI is the average OR across . In general, though, estimating the EI can be very expensive, because a large number of rollouts may be needed to get accurate estimates (sajed2018high). In RL experiments without simulators, it is generally not feasible. In Section 6, we discuss a more practical approach to approximate the EI. First, though, we validate the utility of this true EI.
4.3 Summarizing Interference over Time
To determine the impact of interference on agent performance, we need to be provide summary statistics of interference over time. The above are instaneous interference measures, which can tell us how much interference occurred after an update. However, this interference might have long range impacts, and so performance changes on this step might be impacted by interference many steps ago.
A simple choice is to use an average EI over the last window of time. Unfortunately, this choice is problematic because the EI is signed. A negative EI actually indicates improvement—good generalization. An agent could oscillate between positive and negative EIs, with the average appearing to be near zero. The mean of skewed, potentially multi-modal distributions is not a particularly suitable choice, and we can consider other statistics.
To be more systematic about the choice, let
be the random variable corresponding to EI over the desired window of time. For example, if the agent has been learning for 1000 steps, and the desired window of time is all learning, thenis a scalar RV with a density over the possible instantaneous EIs over this window of 1000 steps. The empirical distribution is the 1000 values of EI.
We consider two statistics, one to measure if the agent had large interference values and the other if interference was highly variable. Catastrophic interference may occur even with only a few steps of very large interference; when reported as an average over time, these large values might be dominated by many small ones. Instead, we can look at the average of the top 10% of interference values—the largest interference—over the window of time. If it is large, then at least 10% of the time the agent had large interference. This type of measure has been used to measure risk, and termed Conditional Value at Risk or sometimes Expected Tail Loss. Correspondingly, we call this the Expected Tail Interference (ETI), defined as
is the (1-)-percentile of the distribution of . In our experiments, we set .
Finally, we can also provide a more accurate measure of variance by considering the interquartile range: the difference between the 75th and 25th percentiles. We call this the Interference Dispersion
Previous work (chan2020measuring) has also considered using conditional value at risk and interquartile range to measure the reliability of reinforcement learning algorithms.
5 Empirical Evaluation: Correlation between Interference and Performance
In the section, we evaluate the utility of the interference measures by computing the correlation with several performance measures, including efficiency, stability and episodic return. The goal is both to validate the utility of these measures of interference—as they would not be useful if uncorrelated with performance—as well as to investigate the impact of common deep RL techniques on interference and control performance.
Environments We use Two-Rooms, designed to induce interference across the rooms, and Cart-pole, in which interference has previously been shown to be problematic (goodrich2015neuron). Two-Rooms is designed so that the agent has sufficient information to learn optimal policies for each room, but the overlap in inputs for the two rooms is likely to cause interference for standard neural network architectures. Cart-pole involves balancing a pole (barto1983neuronlike). Though a simple environment, deep RL agents fail in this domain, or learn unstable policies, as we show in our experiments, and so it provides a useful setting to understand the role of interference on performance. The agent is run a maximal number of steps: 90k for Two-Rooms and 20k for Cart-pole. We run for a fixed number of steps, rather than episode, because otherwise some agents get more environment interactions if they have long episodes. All experiments are averaged over 10 runs.
We investigate well-known deep RL techniques to improving learning, including experience replay, mini-batch updating, Adam optimization (particularly the addition of momentum), and target networks. We consider networks of two hidden layers, with various number of nodes in each layer, batch sizes, buffer sizes and target network update delay. The set of each hyperparameter and other experiment details are in AppendixB.
Performance Metrics We consider four performance measures: average episodic return (AER), consecutive stable performance, stable AER and sample efficiency. The AER reflects accumulated reward by the agent, across all steps of learning. It is computed as follows. For each step during learning, the agent has an associated expected return : how much reward it currently gets within an episode, in expectation. This can be estimated using multiple runs, or using a recent window of returns, to get estimate . The AER is the average of these across the last 50% of steps: . The AER reflects the agents performance, on average, across the second half of its lifespan. We use the second half to gauge performance, because we are interested in assessing the impact of interference in what the agent has learned.
The AER can be measured using online or offline return. An offline is an estimate of the expected return for the policy at time step , measured by averaging over Monte Carlo rollouts. It asks how well the agent would perform if it freezes its policy, and no longer performs updates. An online is an average over the most recent episodic returns obtained by the agent online, computed using an exponential average with weighting 0.1.
We define consecutive stable performance as the maximum number of consecutive steps above a performance threshold (60 step for Two-Rooms, 200 for Cart-pole), divided by the total number of steps. If that number is 1, the agent’s threshold performance is maximally stable; if it is zero, it is maximally unstable. Sample complexity corresponds to the first step that the agent reaches a performance threshold for consecutive steps (we use ), divided by the total number of steps. Sample efficiency is sample complexity. If the agent has less interference, we expect the agents to learn a good policy faster, though an agent that generalizes aggressively—and has high interference—might have good efficiency, but may not stably remain at this performance. Finally, stable AER is defined as where represents the risk profile of the algorithm designer. If the agent has high AER but is unstable, then it will have lower stable AER under a small risk-tolerance .
We measure Kendall’s Rank-Correlation Coefficient, as in (jiang2019fantastic), which reflects if two different measures rank agents similarly. It is agnostic to magnitude or precise numbers: if the interference and performance measure both say agent 1 is better than agent 2, then they are reporting similar outcomes. See Appendix B.2 for the formula.
Results We show the correlation coefficients between the two interference measures, ETI and Interference Dispersion, and the above four performance measures, in Figure 2. We expect negative correlations, since high interference should correspond to low performance. The overall conclusion is that ETI and Interference Dispersion are both negatively correlated with all performance measures, providing some evidence for the validity of these interference measures.
Next, we look at correlations between performance and interference, at a more fine-grained algorithmic level. To do so, we use a scatter plot for each agent, labeled based on the choice of mini-batch size, buffer size and target network update frequency. The y-axis is performance, and the x-axis interference, allowing a visual inspection of correlation between the two as well as general trends for each algorithm choice. We create one scatter plot per environment, per performance measure, and per interference measure; we include only a subset in Figure 3 and the remainder in Appendix C.1. We find several conclusions. 1) The batch size, buffer size and network size did not seem to have a large impact on either interference or performance; instead, target network frequency was the dominating factor. 2) The target network frequency had opposite performance in the two environments: it increases interference in Cart-pole and reduced it in Two-Rooms. In Two-Rooms, target networks improve stable performance at the cost of reducing efficiency.
Besides optimization, another important component of deep reinforcement learning is the function approximator. Therefore, we conduct an experiment to measure interference within a network, in Appendix C.2. We find that updates on the last layer result in significantly higher interference than updates in the internal layers. The result motivates future research directions to mitigate interference: (1) strategies to mitigate interference in the last layer, and (2) algorithms to learn representation such that updating the last layer on top of these representation is robust to interference.
6 Approximating the Expected Interference with TD Errors
It can be impractical to compute the EI, and instead we will need to approximate it. One obvious strategy is simply to estimate from sampled data, and use estimates from a set of sampled states, such as sampled start states. The estimate could be used to initialize , so that fewer updates are needed, as likely and are not too different. Unfortunately, such a simple strategy, and ideas related to directly estimating this difference, perform poorly (see Appendix D.1). The issue is that approximation of EI with these estimates seems highly sensitive to accuracy, and it is expensive—or impossible if there is insufficient data—to get highly accurate estimates.
Instead, we want a proxy measure that is more likely to maintain the same sign as EI: reflect performance improvements if the agent got better, and performance degradation otherwise. A natural proxy measure is the Bellman error. The Bellman error reflects if the agent has gotten closer to a fixed point; if it reduced between steps, then this suggests the agent is closer to the fixed point and likely that there is a performance improvement. Fortunately, there is quite a lot of theory relating the Bellman error to . We extend previous results—namely Lemma 4.3 and Theorem 5.3 in munos2007performance—to the action-value setting. Though relatively straightforward, modifications were needed to allow for differences in distribution over action selection from start states, particularly in the redefinition of concentration coefficients used below. We first present a lemma that upper bounds the EI in terms of the Bellman error. All proofs are in Appendix A.
Let and be a greedy policy with respect to . Then
where , with
This bound tells us that we can sample the state-action pairs proportionally to to upper bound the OR. Sampling according to , however, is typically infeasible and here again we need some approximation. We can usually only expect to have a sampled set of transitions, under some behavior policy, resulting in states in each transition sampled according to some . We can additionally bound this sampling error, by using concentration coefficients. Assume , where implicitly actions are sample uniformly from . We show the result for any and any policies with non-zero support on all actions in Theorem 1, with the informal result written here with and uniform policies for simplicity.
Theorem 1. [Informal] Let and be probability measures on . For greedy w.r.t.
The concentration coefficient reflects differences in state visitation, starting from versus , defined precisely in Appendix A. We test three practical choices of , with .
If this approximation is relatively good, then is approximately proportional to . Recall that EI is . Therefore, a potentially reasonable approximation of EI using the Bellman error is . Even this approximation remains difficult to sample, due to the double sampling problem for Bellman error. Fortunately, we only need to approximate the difference rather than each term. This can be reasonably well approximated uses differences in TD error. Let . By the bias-variance decomposition (antos2008learning), we can show that
The first term is the desired Bellman error, and the second term the variance of the targets. If the environment is deterministic, then this variance is zero. More generally, the Approximate EI, using TD errors, satisfies
The second expectation is likely to be small, because the two parameters likely have similar variances.
6.1 Choosing a Measure to Approximate the Expected Interference
The quality of the approximation is heavily based on the sampling distribution . Ideally, we want a measure such that the concentration coefficient is small, though this is difficult to ascertain. We only have a stream of observations of the agent interacting with the environment, and further can likely only keep a subset of those in a buffer. Sampling from such a buffer is implicitly sampling from a measure , where the data acts like a non-parametric sampling distribution. We can consider multiple strategies for adjusting this sampling distribution, both by choosing what to store in the buffer and by re-weighting samples obtained from the buffer, similarly to importance sampling.
We consider three practical choices. The first, which we call buffer, involves simply sampling from the most recent transitions. The AEI is then approximated by averaging the differences in TD errors from uniformly sampled transitions from this buffer. The second strategy, which we call reservoir, approximates uniform sampling from all the past transition, by maintaining a reservoir buffer. The third strategy, which we call discounted, involves reweighting transitions in the reservoir buffer. To approximately sampling from the discounted future state distribution, we re-weight each transition by where t is the number of steps in that episode. We use re-weighting instead of sampling since we would like the measure to have smaller variance.
6.2 Empirical Correlation between EI and AEI
We empirically demonstrate that the approximations of interference are correlated with true interference in Two-Rooms and Cart-pole. We sample 1000 transitions, which is a relatively small number compared to the state space, and so more reflective of realistic limitations. We measure Pearson correlation in Figure 4, between EI and AEI per step as well as ETI and Approximate ETI, for two agents. We provide the details in Appendix D.
Though there are several approximation steps above, we find that AEI correlate highly with EI, most clearly in Cart-pole but also in Two-Rooms. The sampling strategies are similarly effective, though reservoir sampling seems to be most effective. We also conduct the same experiments for AEI as in Section 5 are in Appendix D, with similar conclusions, though with slightly reduced correlations to performance measures.
In this paper, we propose a definition of interference for control in RL, and provide a practical approximation using TD errors. We validate the utility of the interference measures by computing the correlation with several performance metric. Using the proposed measures, we provide some insights into interference in deep reinforcement learning algorithms. We highlighted the role of the target network, which we found significantly increased interference and decreased performance in a setting where it was not needed. In another setting, however, the lack of a target network resulted in fast but unstable learning, and we found the opposite conclusion. In both cases, the correlation to interference was clear, for both the true and approximate measures.
This is one of the first papers specifically attempting to define interference for control, and naturally has limitations. One important next step is to expand the set of environments, and agents. In this first small-scale study, we developed a methodology for such experiments, which can be leveraged to extend to new settings. Another important step is to further explore approximations to the true interference, as well as find more clear theoretical reasons why we see that the change in TD-errors performs so well as a proxy. Finally, this paper focuses on deterministic, greedy policies with learned action-values. There is some evidence that a mixture of policies might be more robust to interference (kakade2002approximately; vieillard2019deep). Stochastic policies naturally fit in our definition of EI, but our approximation may not be as suitable.
This work focuses on characterizing and understanding an RL agent’s behavior. It is unlikely to have a direct impact on society although it may guide future research with such an impact. For example, future research following this work may involve developing stable and practical RL algorithms applied to real world problems.
Appendix A Proofs and Technical Details
where and all other components are zeros. This notation of , first introduced in wang2007dual, is convenient to use since gives the state to state transition and gives the state-action to state-action transition.
Given an action-value function , we define the Bellman operator w.r.t. a policy by.
where is the expected immediate reward from state after taking action . Let denote the greedy policy w.r.t. , the Bellman optimality operator is defined by
Since is the greedy policy, we can show that, for any policy ,
Here denotes the component-wise inequality. Moreover, it is known that the is the fixed point of the operator, that is,
Let and be a greedy policy with respect to . Then
where , with a stochastic matrix.
Proof of Lemma 1.
Using the fact that , and , we can show
Note that is invertible, so we have
Moreover, we can derive a component-wise equality between the Bellman residual and :
Let be a policy such that has full support over the action space for all states, be a sequence of policies, and and be two measures on . For any integer , we define
Let and if is not absolutely continuous w.r.t. . We define the discounted future state distribution concentration coefficients as
In practice, we could choose the behavior policy as an uniform random policy or a -greedy policy.
Let , be a greedy policy with respect to , be an uniform policy and be a behavior policy. Let and be two probability measures on , and . Then,
Proof of Theorem 1.
We can write
where and is a stochastic matrix. Then,
The second inequality follows from Jensen’s inequality. The third inequality follows from . ∎
Appendix B Experimental Details
b.1 Experiment set-up
We experiment with two environments: (1) Two-Rooms: we set the maximum steps per episode to 200, and the number of training steps to 90k, and (2) Cart-pole from OpenAI gym (https://gym.openai.com/): We set the maximum steps per episode to 500, and the number of training steps to 20k. We use a discounting factor in both environments.
The environment Two-Rooms consists of two rooms with different start and goal states. In the first room the agent should navigate up and to the right, and in the second room down and to the left. The input state contains the xy position of the agent, which is in for both rooms, and which room the agent is in, which is in .
For all experiments, we use a two-layer neural network with ReLU activation, and use He intialization to initialize the neural networks.
For the experiments in Section 5, we generate a set of hyper-parameter by choosing each parameter in the set:
Target network update frequency where zero means no target network is used
For tile coding, we use 4 tiles and 16 tilings with a constant step size. The step size are searched in the set by the best online AER. For SR-NN in Appendix C.2, we fix and use a grid search for the key parameter: . For Section 6.2 and Appendix D.1, we choose a standard neural network with hidden size of 128, batch size of 64, buffer size of 1000 and no target network.
b.2 Kendall’s rank-correlation coefficient
Inspired from [jiang2019fantastic], we use Kendall’s rank-correlation coefficient [kendall1938new] to check the correlation between a performance metric and a statistics of our interference measures. Let be a set of hyperparameters and where is a statistics of our interference measures and is a performance measures corresponding to a hyperparameter configuration . Kendall’s rank coefficient is defined as
The coefficient varies between and .
Appendix C Additional Experiments of Section 5
c.1 Correlation between interference and performance
We show the scatter plots for Cart-pole in Figure 7 and 8, and for Two-Rooms in Figure 9. In Two-Rooms, we are interested in the performance when the agent has trained on room 2 for a long time. Therefore, we measure interference and performance for the second half of training on room 2.
The results show show relatively consistent correlation between ETI and performance measures. The notable exception is in Two-Rooms, when there are no target networks. The agents have high interference, but also high AER, for all three variants of AER. The consecutive stable performance and sample efficiency plots sheds some light on why this occurs. Target networks slow learning in this environment, but then maintain stable performance above the threshold for consecutive stable performance. These same agents, though, look worse in terms of AER, than the No Target Network agents, which oscillate more but manage to get to higher performance. The plots are skewed by the fact that, with Target Networks, learning is not quite done when we start measuring interference, in that second half of Room 2. Consequently, though the agent is above the threshold of acceptable performance, it is still on the rise. The lower 10% of the returns is much lower for some of the agents with target networks, than those without, because of this fact. If we allowed the agents to learn for even longer, this point that drop low on the AER plots would likely move up higher, and we would see a clear trend from the cluster of points near zero interference and high performance, the cluster of points with high interference (those without target networks).
c.2 Measuring interference within a network
In deep reinforcement learning, neural networks are used as the function approximatior. We want to understand how much interference is due to the internal layers and the last layer. The last layer typically does not have an activation; hence we can view a value function as a two-part approximation with a representation function and a linear weight , where is the weights in the last layer and is the representation learned by the network with weights , composed of all the hidden layers in the network. The function corresponds to the last layer in the network, with the weights of the network.
To study interference separately within the network, we use the stochastic block coordinate descent (SBCD) Q-learning to update and seperately:
|Representation Learning Network (RLN) updates:|
|Value Learning Network updates (VLN) updates:|
where is the mini-batch size, and and are learning rate.
We measure interference for RLN and VLN updates separately at every step, and report the ETI and ETI for approximations in Table 1. We can observe that VLN has much higher ETI than RLN. The result suggests that updates on the last layer result in significantly higher interference than updates on the internal layers, even when we decrease the learning rate for VLN.
We include a baseline SR-NN [liu2019utility], which learns a sparse representation , to see how representation learning can reduce interference in VLN. SR-NN uses the distributional regularizers to learn sparse representation in neural networks:
where is a regularization on the expected activation, i.e., and denote the -th component of . Table 1 shows that SBCD with SR-NN has a lower ETI for VLN.
|ETI for RLN||ETI for VLN||Control Performance|
|SBCD Q-learning||5.05 0.27||18.23 3.58||84.45 0.76|
|SBCD Q-learning (smaller )||3.83 0.26||14.05 1.59||86.85 0.39|
|SBCD with SR-NN||3.48 0.21||5.09 0.32||89.52 0.41|
Appendix D Additional Experiments of Section 6
d.1 Empirical comparison of approximation strategies
Besides approximate EI using TD errors, we test two approximation baselines. First, we could in fact directly approximate the change in Bellman error using recent insights on Kernel Bellman Errors [feng2019kernel], though the approximation is still quite expensive to compute. For example, if we use transitions to evaluate TD errors, which requires computation, evaluating Kernel loss requires computation. Hence, we use only 100 transitions (from a reservoir buffer) to evaluate the approximation. Second, we can estimate from sampled data in the buffer using off-policy policy evaluation (OPE), and directly approximate EI from a set of sampled state-action pairs from
. At each evaluation step, we run off-policy SARSA algorithm for 10 epochs over the data stored in a vanilla replay buffer. We call the first baselinekernel, and the second baseline OPE.
During training, we compute and the true measure every steps. We collect all data points from the second half of training steps, over runs, and report Pearson correlation coefficient between AEI and EI. Formally, Pearson correlation coefficient between two sets of measures and is defined as
We show the results in Figure 5. The results suggest that change in TD errors has higher correlation coefficients than other approximation baselines.
d.2 Results using AEI for Deep RL
In this section, we present the same experiments as in Section 5 and Appendix C.2, with the Approximate EI. We can draw similar conclusions, though with slightly reduced correlations to performance measures. Figure 6 shows that Approximate ETI and ID are negatively correlated with several performance measures. Table 2 shows that VLN has higher Approximate ETI than RLN.
|ETI for RLN||ETI for VLN||Performance|
|SBCD Q-learning||0.22 0.02||1.02 0.29||84.45 0.76|
|SBCD Q-learning (smaller )||0.08 0.01||0.29 0.04||86.85 0.39|
|SBCD with SR-NN||0.12 0.01||0.08 0.01||89.52 0.41|