During the last several years Deep Reinforcement Learning proved to be a fruitful approach to many artificial intelligence tasks of diverse domains. Breakthrough achievements include reaching human-level performance in such complex games as Go , multiplayer Dota  and real-time strategy StarCraft II . The generality of DRL framework allows its application in both discrete and continuous domains to solve tasks in robotics and simulated environments .
Reinforcement Learning (RL) is usually viewed as general formalization of decision-making task and is deeply connected to dynamic programming, optimal control and game theory.
Yet its problem setting makes almost no assumptions about world model or its structure and usually supposes that environment is given to agent in a form of black-box. This allows to apply RL practically in all settings and forces designed algorithms to be adaptive to many kinds of challenges. Latest RL algorithms are usually reported to be transferable from one task to another with no task-specific changes and little to no hyperparameters tuning.
As an object of desire is a strategy, i. e. a function mapping agent’s observations to possible actions, reinforcement learning is considered to be a subfiled of machine learning. But instead of learning from data, as it is established in classical supervised and unsupervised learning problems, the agent learns from experience of interacting with environment. Being more "natural" model of learning, this setting causes new challenges, peculiar only to reinforcement learning, such as necessity of exploration integration and the problem of delayed and sparse rewards. The full setup and essential notation are introduced in section2.
Classical Reinforcement Learning research in the last third of previous century developed an extensive theoretical core for modern algorithms to ground on. Several algorithms are known ever since and are able to solve small-scale problems when either environment states can be enumerated (and stored in the memory) or optimal policy can be searched in the space of linear or quadratic functions of state representation features. Although these restrictions are extremely limiting, foundations of classical RL theory underlie modern approaches. These theoretical fundamentals are discussed in sections 3.1 and 5.1–5.2.
Combining this framework with Deep Learning  was popularized by Deep Q-Learning algorithm, introduced in , which was able to play any of 57 Atari console games without tweaking network architecture or algorithm hyperparameters. This novel approach was extensively researched and significantly improved in the following years. The principles of value-based direction in deep reinforcement learning are presented in section 3.
One of the key ideas in the recent value-based DRL research is distributional approach, proposed in . Further extending classical theoretical foundations and coming with practical DRL algorithms, it gave birth to distributional reinforcement learning paradigm, which potential is now being actively investigated. Its ideas are described in section 4.
Second main direction of DRL research is policy gradient methods, which attempt to directly optimize the objective function, explicitly present in the problem setup. Their application to neural networks involve a series of particular obstacles, which requested specialized optimization techniques. Today they represent a competitive and scalable approach in deep reinforcement learning due to their enormous parallelization potential and continuous domain applicability. Policy gradient methods are discussed in section5.
Despite the wide range of successes, current state-of-art DRL methods still face a number of significant drawbacks. As training of neural networks requires huge amounts of data, DRL demonstrates unsatisfying results in settings where data generation is expensive. Even in cases where interaction is nearly free (e. g. in simulated environments), DRL algorithms tend to require excessive amounts of iterations, which raise their computational and wall-clock time cost. Furthermore, DRL suffers from random initialization and hyperparameters sensitivity, and its optimization process is known to be uncomfortably unstable . Especially embarrassing consequence of these DRL features turned out to be low reproducibility of empirical observations from different research groups . In section 6, we attempt to launch state-of-art DRL algorithms on several standard testbed environments and discuss practical nuances of their application.
2 Reinforcement Learning problem setup
2.1 Assumptions of RL setting
Informally, the process of sequential decision-making proceeds as follows. The agent is provided with some initial observation of environment and is required to choose some action from the given set of possibilities. The environment responds by transitioning to another state and generating a reward signal
(scalar number), which is considered to be a ground-truth estimation of agent’s performance. The process continues repeatedly with agent making choices of actions from observations and environment responding with next states and reward signals. The only goal of agent is to maximize the cumulative reward.
This description of learning process model already introduces several key assumptions. Firstly, the time space is considered to be discrete, as agent interacts with environment sequentially. Secondly, it is assumed that provided environment incorporates some reward function as supervised indicator of success. This is an embodiment of the reward hypothesis, also referred to as Reinforcement Learning hypothesis:
(Reward Hypothesis) 
<<All of what we mean by goals and purposes can be well thought of as maximization of the expected value of the cumulative sum of a received scalar signal (reward).>>
Exploitation of this hypothesis draws a line between reinforcement learning and classical machine learning settings, supervised and unsupervised learning. Unlike unsupervised learning, RL assumes supervision, which, similar to labels in data for supervised learning, has a stochastic nature and represents a key source of knowledge. At the same time, no data or <
For practical applications it is also natural to assume that agent’s observations can be represented by some feature vectors, i. e. elements of. The set of possible actions in most practical applications is usually uncomplicated and is either discrete (number of possible actions is finite) or can be represented as subset of (almost always or can be reduced to this case)111this set is considered to be permanent for all states of environment without any loss of generality as if agent chooses invalid action the world may remain in the same state with zero or negative reward signal or stochastically select some valid action for him.. RL algorithms are usually restricted to these two cases, but the mix of two (agent is required to choose both discrete and continuous quantities) can also be considered.
The final assumption of RL paradigm is a Markovian property:
Transitions depend solely on previous state and the last chosen action and are independent of all previous interaction history.
Although this assumption may seem overly strong, it actually formalizes the fact that the world modeled by considered environment obeys some general laws. Giving that the agent knows the current state of the world and the laws, it is assumed that it is able to predict the consequences of his actions up to the internal stochasticity of these laws. In practice, both laws and complete state representation is unavailable to agent, which limits its forecasting capability.
In the sequel we will work within the setting with one more assumption of full observability. This simplification supposes that agent can observe complete world state, while in many real-life tasks only a part of observations is actually available. This restriction of RL theory can be removed by considering Partially observable Markov Decision Processes (PoMDP)
Partially observable Markov Decision Processes (PoMDP), which basically forces learning algorithms to have some kind of memory mechanism to store previously received observations. Further on we will stick to fully observable case.
2.2 Environment model
Though the definition of Markov Decision Process (MDP) varies from source to source, its essential meaning remains the same. The definition below utilizes several simplifications without loss of generality.222the reward function is often introduced as stochastic and dependent on action , i. e. , while instead of fixed a distribution over is given. Both extensions can be taken into account in terms of presented definition by extending the state space and incorporating all the uncertainty into transition probability
is given. Both extensions can be taken into account in terms of presented definition by extending the state space and incorporating all the uncertainty into transition probability.
Markov Decision Process (MDP) is a tuple , where:
— arbitrary set, called the state space.
— a set, called the action space, either
discrete: , or
continuous domain: .
— transition probability , where .
— reward function.
— starting state.
It is important to notice that in the most general case the only things available for RL algorithm beforehand are (dimension of state space) and action space . The only possible way of collecting more information for agent is to interact with provided environment and observe . It is obvious that the first choice of action will be probably random. While the environment responds by sampling , this distribution, defined in and considered to be a part of MDP, may be unavailable to agent’s learning procedure. What agent does observe is and reward signal and it is the key information gathered by agent from interaction experience.
The tuple is called transition. Several sequential transitions are usually referred to as roll-out. Full track of observed quantities
is called a trajectory.
In general case, the trajectory is infinite which means that the interaction process is neverending. However, in most practical cases the episodic property holds, which basically means that the interaction will eventually come to some sort of an end333natural examples include the end of the game or agent’s failure/success in completing some task.. Formally, it can be simulated by the environment stucking in the last state with zero probability of transitioning to any other state and zero reward signal. Then it is convenient to reset the environment back to to initiate new interaction. One such interaction cycle from till reset, spawning one trajectory of some finite length , is called an episode. Without loss of generality, it can be considered that there exists a set of terminal states , which mark the ends of interactions. By convention, transitions are accompanied with binary flag , whether belongs to . As timestep at which the transition was gathered is usually of no importance, transitions are often denoted as with primes marking the <<next timestep>>.
Note that the length of episode may vary between different interactions, but the episodic property holds if interaction is guaranteed to end after some finite time . If this is not the case, the task is called continuing.
In reinforcement learning, the agent’s goal is to maximize a cumulative reward. In episodic case, this reward can be expressed as a summation of all received reward signals during one episode and is called the return:
Note that this quantity is formally a random variable, which depends on agent’s choices and the outcomes of environment transitions. As this stochasticity is an inevitable part of interaction process, the underlying distribution from whichis sampled must be properly introduced to set rigorously the task of return maximization.
Agent’s algorithm for choosing by given current state , which in general can be viewed as distribution on domain , is called a policy (strategy).
Deterministic policy, when the policy is represented by deterministic function , can be viewed as a particular case of stochastic policy with degenerated policy , when agent’s output is still a distribution with zero probability to choose an action other than . In both cases it is considered that agent sends to environment a sample .
Note that given some policy and transition probabilities , the complete interaction process becomes defined from probabilistic point of view:
For given MDP and policy , the probability of observing
is called trajectory distribution and is denoted as :
It is always substantial to keep track of what policy was used to collect certain transitions (roll-outs and episodes) during the learning procedure, as they are essentially samples from corresponding trajectory distribution. If the policy is modified in any way, the trajectory distribution changes either.
Now when a policy induces a trajectory distribution, it is possible to formulate a task of expected reward maximization:
To ensure the finiteness of this expectation and avoid the case when agent is allowed to gather infinite reward, limit on absolute value of can be assumed:
Together with the limit on episode length this restriction guarantees finiteness of optimal (maximal) expected reward.
To extend this intuition to continuing tasks, the reward for each next interaction step is multiplied on some discount coefficient , which is often introduced as part of MDP. This corresponds to the logic that with probability agent <<dies>> and does not gain any additional reward, which models the paradigm <<better now than later>>. In practice, this discount factor is set very close to 1.
For given MDP and policy the discounted expected reward is defined as
Reinforcement learning task is to find an optimal policy , which maximizes the discounted expected reward:
2.4 Value functions
Solving reinforcement learning task (2) usually leads to a policy, that maximizes the expected reward not only for starting state , but for any state . This follows from the Markov property: the reward which is yet to be collected from some step does not depend on previous history and for agent staying at state the task of behaving optimal is equivalent to maximization of expected reward with current state as a starting state. This is the particular reason why many reinforcement learning algorithms do not seek only optimal policy, but additional information about usefulness of each state.
For given MDP and policy the value function under policy is defined as
This value function estimates how good it is for agent utilizing strategy to visit state and generalizes the notion of discounted expected reward that corresponds to .
As value function can be induced by any policy, value function under optimal policy can also be considered. By convention444though optimal policy may not be unique, the value functions under any optimal policy that behaves optimally from any given state (not only ) coincide. Yet, optimal policy may not know optimal behaviour for some states if it knows how to avoid them with probability 1., it is denoted as and is called an optimal value function.
Obtaining optimal value function doesn’t provide enough information to reconstruct some optimal policy due to unknown world dynamics, i. e. transition probabilities. In other words, being blind to what state may be the environment’s response on certain action in a given state makes knowing optimal value function unhelpful. This intuition suggests to introduce a similar notion comprising more information:
For given MDP and policy the quality function (Q-function) under policy is defined as
It directly follows from the definitions that these two functions are deeply interconnected:
The notion of optimal Q-function can be introduced analogically. But, unlike value function, obtaining actually means solving a reinforcement learning task: indeed,
If is a quality function under some optimal policy, then
is an optimal policy.
This result implies that instead of searching for optimal policy , an agent can search for optimal Q-function and derive the policy from it.
For any MDP existence of optimal policy leads to existence of deterministic optimal policy.
2.5 Classes of algorithms
Reinforcement learning algorithms are presented in a form of computational procedures specifying a strategy of collecting interaction experience and obtaining a policy with as higher as possible. They rarely include a stopping criterion like in classic optimization methods as the stochasticity of given setting prevents any reasonable verification of optimality; usually the number of iterations to perform is determined by the amount of computational resources.
All reinforcement learning algorithms can be roughly divided into four555 in many sources evolutionary algorithms are bypassed in discussion as they do not utilize the structure of RL task in any way.
in many sources evolutionary algorithms are bypassed in discussion as they do not utilize the structure of RL task in any way.classes:
meta-heuristics: this class of algorithms treats the task as black-box optimization with zeroth-order oracle. They usually generate a set of policies and launch several episodes of interaction for each to determine best and worst policies according to average return. After that they try to construct more optimal policies using evolutionary or advanced random search techniques .
policy gradient: these algorithms directly optimize (2), trying to obtain
and no additional information about MDP, using approximate estimations of gradient with respect to policy parameters. They consider RL task as an optimization with stochastic first-order oracle and make use of interaction structure to lower the variance of gradient estimations. They will be discussed in sec.5.
value-based algorithms construct optimal policy implicitly by gaining an approximation of optimal Q-function using dynamic programming. In DRL, Q-function is represented with neural network and an approximate dynamic programming is performed using reduction to supervised learning. This framework will be discussed in sec. 3 and 4.
model-based algorithms exploit learned or given world dynamics, i. e. distributions from . The class of algorithms to work with when the model is explicitly provided is represented by such algorithms as Monte-Carlo Tree Search; if not, it is possible to imitate the world dynamics by learning the outputs of black box from interaction experience .
2.6 Measurements of performance
Achieved performance (score) from the point of average cumulative reward is not the only one measure of RL algorithm quality. When speaking of real-life robots, the required number of simulated episodes is always the biggest concern. It is usually measured in terms of interaction steps (where step is one transition performed by environment) and is referred to as sample efficiency.
When the simulation is more or less cheap, RL algorithms can be viewed as a special kind of optimization procedures. In this case, the final performance of the found policy is opposed to required computational resources, measured by wall-clock time. In most cases RL algorithms can be expected to find better policy after more iterations, but the amount of these iterations tend to be unjustified.
The ratio between amount of interactions and required wall-clock time for one update of policy varies significantly for different algorithms. It is well-known that model-based algorithms tend to have the greatest sample-efficiency at the cost of expensive update iterations, while evolutionary algorithms require excessive amounts of interactions while providing massive resources for parallelization and reduction of wall-clock time. Value-based and policy gradient algorithms, which will be the focus of our further discussion, are known to lie somewhere in between.
3 Value-based algorithms
3.1 Temporal Difference learning
In this section we consider temporal difference learning algorithm [21, Chapter 6], which is a classical Reinforcement Learning method in the base of modern value-based approach in DRL.
The first idea behind this algorithm is to search for optimal Q-function by solving a system of recursive equations which can be derived by recalling interconnection between Q-function and value function (3):
This equation, named Bellman equation, remains true for value functions under any policies including optimal policy :
(Bellman optimality equation)
The straightforward utilization of this result is as follows. Consider the tabular case, when both state space and action space are finite (and small enough to be listed in computer memory). Let us also assume for now that transition probabilities are available to training procedure. Then can be represented as a finite table with numbers. In this case (6) just gives a set of equations for this table to satisfy.
Addressing the values of the table as unknown variables, this system of equations can be solved using basic point iteration method: let be initial arbitrary values of table (with the only exception that for terminal states , if any, for all actions ). On each iteration the table is updated by substituting current values of the table to the right side of equation until the process converges:
This straightforward approach of learning the optimal Q-function, named Q-learning, has been extensively studied in classical Reinforcement Learning. One of the central results is presented in the following convergence theorem:
Let by denote an operator , updating as in (7):
for all state-action pairs .
Then is a contraction mapping, i. .e. for any two tables
Therefore, there is a unique fixed point of the system of equations (7) and the point iteration method converges to it.
The contraction mapping property is actually of high importance. It demonstrates that the point iteration algorithm converges with exponential speed and requires small amount of iterations. As the true is a fixed point of (6), the algorithm is guaranteed to yield a correct answer. The trick is that each iteration demands full pass across all state-action pairs and exact computation of expectations over transition probabilities.
In general case, these expectations can’t be explicitly computed. Instead, agent is restricted to samples from transition probabilities gained during some interaction experience. Temporal Difference (TD)666also known as TD(0) due to theoretical generalizations algorithm proposes to collect this data using and after each gathered transition update only one cell of the table:
where plays the role of exponential smoothing parameter for estimating expectation from samples.
Two key ideas are introduced in the update formula (8): exponential smoothing instead of exact expectation computation and cell by cell updates instead of updating full table at once. Both are required to settle Q-learning algorithm for online application.
As the set of terminal states in online setting is usually unknown beforehand, a slight modification of update (8) is used. If observed next state turns out to be terminal (recall the convention to denote this by flag ), its value function is known to be equal to zero:
This knowledge is embedded in the update rule (8) by multiplying on . For the sake of shortness, this factor is often omitted but should be always present in implementations.
Second important note about formula (8) is that it can be rewritten in the following equivalent way:
The expression in the brackets, referred to as temporal difference, represents a difference between Q-value and its one-step approximation , which must be zero in expectation for true optimal Q-function.
The idea of exponential smoothing allows us to formulate first practical algorithm which can work in the tabular case with unknown world dynamics:
[label = TDalgorithm]Temporal Difference algorithm Hyperparameters:
On each interaction step:
It turns out that under several assumptions on state visitation during interaction process this procedure holds similar properties in terms of convergence guarantees, which are stated by the following theorem:
 Let’s define
Then if for every state-action pair
the algorithm LABEL:TDalgorithm converges to optimal with probability 1.
This theorem states that basic policy iteration method can be actually applied online in the way proposed by TD algorithm, but demands <<enough exploration>> from the strategy of interacting with MDP during training. Satisfying this demand remains a unique and common problem of reinforcement learning.
The widespread kludge is -greedy strategy which basically suggests to choose random action instead of with probability . The probability is usually set close to 1 during first interaction iterations and scheduled to decrease to a constant close to 0. This heuristic makes agent visit all states with non-zero probabilities independent of what current approximation suggests.
The main practical issue with Temporal Difference algorithm is that it requires table to be explicitly stored in memory, which is impossible for MDP with high state space complexity. This limitation substantially restricted its applicability until its combination with deep neural network was proposed.
3.2 Deep Q-learning (DQN)
Utilization of neural nets to model either a policy or a Q-function frees from constructing task-specific features and opens possibilities of applying RL algorithms to complex tasks, e. g. tasks with images as input. Video games are classical example of such tasks where raw pixels of screen are provided as state representation and, correspondingly, as input to either policy or Q-function.
Main idea of Deep Q-learning  is to adapt Temporal Difference algorithm so that update formula (9) would be equivalent to gradient descent step for training a neural network to solve a certain regression task. Indeed, it can be noticed that the exponential smoothing parameter resembles learning rate of first-order gradient optimization procedures, while the exploration conditions from theorem 3.1
look identical to restrictions on learning rate of stochastic gradient descent.
The key hint is that (9) is actually a gradient descent step in the parameter space of the table functions family:
where all form a vector of parameters .
To unravel this fact, it is convenient to introduce some notation from regression tasks. First, let’s denote by the target of our regression task, i. e. the quantity that our model is trying to predict:
where is a sample from and is input data. In this notation (9) is equivalent to:
where we multiplied scalar value on the following vector
to formulate an update of only one component of in a vector form. By this we transitioned to update in parameter space using . Remark that for table functions family the derivative of by for given input
is its one-hot encoding, i. e. exactly:
The statement now is that this formula is a gradient descent update for regression with input , target and MSE loss function:
The obtained result is evidently a gradient descent step formula to minimize MSE loss function with target (10):
It is important that dependence of from is ignored during gradient computation (otherwise the chain rule application with being dependent on is incorrect). On each step of temporal difference algorithm new target is constructed using current Q-function approximation, and a new regression task with this target is set. For this fixed target one MSE optimization step is done according to (13), and on the next step a new regression task is defined. Though during each step the target is considered to represent some ground truth like it is in supervised learning, here it provides a direction of optimization and because of this reason is sometimes called a guess.
Notice that representation (13) is equivalent to standard TD update (9) with all theoretical results remaining while the parametric family is a table functions family. At the same time, (13) can be formally applied to any parametric function family including neural networks. It must be taken into account that this transition is not rigorous and all theoretical guarantees provided by theorem 3.1
are lost at this moment.
Further on we assume that optimal Q-function is approximated with neural network with parameters . Note that for discrete action space case this network may take only as input and output numbers representing , which allows to find an optimal action in a given state with a single forward pass through the net. Therefore target for given transition can be computed with one forward pass and optimization step can be performed in one more forward777in implementations it is possible to combine and in one batch and perform these two forward passes <<at once>>. and one backward pass.
Small issue with this straightforward approach is that, of course, it is impractical to train neural networks with batches of size 1. In  it is proposed to use experience replay to store all collected transitions as data samples and on each iteration sample a batch of standard for neural networks training size. As usual, the loss function is assumed to be an average of losses for each transition from the batch. This utilization of previously experienced transitions is legit because TD algorithm is known to be an off-policy algorithm, which means it can work with arbitrary transitions gathered by any agent’s interaction experience. One more important benefit from experience replay is sample decorrelation as consecutive transitions from interaction are often similar to each other since agent usually locates at the particular part of MDP.
Though empirical results of described algorithm turned out to be promising, the behaviour of values indicated the instability of learning process. Reconstruction of target after each optimization step led to so-called compound error when approximation error propagated from the close-to-terminal states to the starting in avalanche manner and could lead to guess being and more times bigger than the true value. To address this problem,  introduced a kludge known as target network, which basic idea is to solve fixed regression problem for steps, i. .e. recompute target every -th step instead of each.
To avoid target recomputation for the whole experience replay, the copy of neural network is stored, called the target network. Its architecture is the same while weights are a copy of from the moment of last target recomputation888alternative, but more computationally expensive option, is to update target network weights on each step using exponential smoothing and its main purpose is to generate targets for given current batch.
Combining all things together and adding -greedy strategy to facilitate exploration, we obtain classic DQN algorithm:
[label = DQNalgorithm]Deep Q-learning (DQN) Hyperparameters: — batch size, — target network update frequency, — greedy exploration parameter, — neural network, SGD optimizer.
Initialize weights of arbitrary
On each interaction step:
select randomly with probability , else
add observed transition to experience replay
sample batch of size from experience replay
for each transition from the batch compute target:
make a step of gradient descent using
3.3 Double DQN
Although target network successfully prevented from unbounded growth and empirically stabilized learning process, the values of on many domains were evident to tend to overestimation. The problem is presumed to reside in max operation in target construction formula (10):
During this estimation shifts Q-value estimation towards either to those actions that led to high reward due to luck or to the actions with overestimating approximation error.
The solution proposed in  is based on idea of separating action selection and action evaluation to carry out each of these operations using its own approximation of :
The simplest, but expensive, implementation of this idea is to run two independent DQN (<<Twin DQN>>) algorithms and use the twin network to evaluate actions:
Intuitively, each Q-function here may prefer lucky or overestimated actions, but the other Q-function judges them according to its own luck and approximation error, which may be as underestimating as overestimating. Ideally these two DQNs should not share interaction experience to achieve that, which makes such algorithm twice as expensive both in terms of computational cost and sample efficiency.
Double DQN  is more compromised option which suggests to use current weights of network for action selection and target network weights for action evaluation, assuming that when the target network update frequency is big enough these two networks are sufficiently different:
3.4 Dueling DQN
Another issue with DQN algorithm LABEL:DQNalgorithm emerges when a huge part of considered MDP consists of states of low optimal value , which is an often case. The problem is that when the agent visits unpromising state instead of lowering its value it remembers only low pay-off for performing some action in it by updating . This leads to regular returns to this state during future interactions until all actions prove to be unpromising and all are updated. The problem gets worse when the cardinality of action space is high or there are many similar actions in action space.
One benefit of deep reinforcement learning is that we are able to facilitate generalization across actions by specifying the architecture of neural network. To do so, we need to encourage the learning of from updates of . The idea of dueling architecture  is to incorporate approximation of explicitly in computational graph. For that purpose we need the definition of advantage function:
For given MDP and policy the advantage function under policy is defined as
Advantage function is evidently interconnected with Q-function and value function and actually shows the relative advantage of selecting action comparing to average performance of the policy. If for some state , then modifying to select more often in this particular state will lead to better policy as its average return will become bigger than initial . This follows from the following property of arbitrary advantage function:
Definition of optimal advantage function is analogous and allows us to reformulate in terms of and :
Straightforward utilization of this decomposition is following: after several feature extracting layers the network is joined with two heads, one outputting single scalarand one outputting numbers like it was done in DQN for Q-function. After that this scalar value estimation is added to all components of in order to obtain according to (16). The problem with this naive approach is that due to (15) advantage function can not be arbitrary and must hold the property (15) for to be identifiable.
This restriction (15) on advantage function can be simplified for the case when optimal policy is induced by optimal Q-function:
This condition can be easily satisfied in computational graph by subtracting from advantage head. This will be equivalent to the following formula of dueling DQN:
The interesting nuance of this improvement is that after evaluation on Atari-57 authors discovered that substituting max operation in (17) with averaging across actions led to better results (while usage of unidentifiable formula (16
) led to poor performance). Although gradients can be backpropagated through both operation and formula (17) seems theoretically justified, in practical implementations averaging instead of maximum is widespread.
3.5 Noisy DQN
By default, DQN algorithm does not concern the exploration problem and is always augmented with -greedy strategy to force agent to discover new states. This baseline exploration strategy suffers from being extremely hyperparameter-sensitive as early decrease of to close to zero values may lead to stucking in local optima, when agent is unable to explore new options due to imperfect , while high values of force agent to behave randomly for excessive amount of episodes, which slows down learning. In other words, -greedy strategy transfers responsibility to solve exploration-exploitation trade-off on engineer.
The key reason why -greedy exploration strategy is relatively primitive is that exploration priority does not depend on current state. Intuitively, the choice whether to exploit knowledge by selecting approximately optimal action or to explore MDP by selecting some other depends on how explored the current state is. Discovering a new part of state space after any amount of interaction probably indicates that random actions are good to try there, while close-to-initial states will probably be sufficiently explored after several first episodes.
In -greedy strategy agent selects action using deterministic and only afterwards injects state-independent noise in a form of probability of choosing random action. Noisy networks  were proposed as a simple extension of DQN to provide state-dependent and parameter-free exploration by injecting noise of trainable volume to all (or most999usually it is not injected in very first layers responsible for feature extraction like convolutional layers in networks for images as input.) nodes in computational graph.
Let a linear layer with inputs and outputs in q-network perform the following computation:
where is input, — weights matrix, — bias. In noisy layers it is proposed to substitute deterministic parameters with samples from where are trained with gradient descent101010using standard reparametrization trick. On the forward pass through the noisy layer we sample and then compute
where denotes element-wise multiplication, — trainable parameters of the layer. Note that the number of parameters for such layers is doubled comparing to ordinary layers.
As the output of q-network now becomes a random variable, loss value becomes a random variable too. Like in similar models for supervised learning, on each step an expectation of loss function over noise is minimized:
The gradient in this setting can be estimated using Monte-Carlo:
It can be seen that amount of noise actually inflicting output of network may vary for different inputs, i. e. for different states. There are no guarantees that this amount will reduce as the interaction proceeds; the behaviour of average magnitude of noise injected in the network with time is reported to be extremely sensitive to initialization of and vary from MDP to MDP.
One technical issue with noisy layers is that on each pass an excessive amount (by the number of network parameters) of noise samples is required. This may substantially reduce computational efficiency of forward pass through the network. For optimization purposes it is proposed to obtain noise for weights matrices in the following way: sample just noise samples and acquire matrix noise in a factorized form:
where is a scaling function, e. g. . The benefit of this procedure is that it requires samples instead of , but sacrifices the interlayer independence of noise.
3.6 Prioritized experience replay
In DQN each batch of transitions is sampled from experience replay using uniform distribution, treating collected data as equally prioritized. In such scheme states for each update come from the same distribution as they come from interaction experience (except that they become decorellated), which agrees with TD algorithm as the basement of DQN.
Intuitively observed transitions vary in their importance. At the beginning of training most guesses tend to be more or less random as they rely on arbitrarily initialized and the only source of trusted information are transitions with non-zero received reward, especially near terminal states where is known to be equal to 0. In the midway of training, most of experience replay is filled with the memory of interaction within well-learned part of MDP while the most crucial information is contained in transitions where agent explored new promising areas and gained novel reward yet to be propagated through Bellman equation. All these significant transitions are drowned in collected data and rarely appear in sampled batches.
The central idea of prioritized experience replay  is that priority of some transition is proportional to temporal difference:
Using these priorities as proxy of transition importances, sampling from experience replay proceeds using following probabilities:
where hyperparameter controls the degree to which the sampling weights are sparsified: the case corresponds to uniform sampling distribution while is equivalent to greedy sampling of transitions with highest priority.
The problem with (18) claim is that each transition’s priority changes after each network update. As it is impractical to recalculate loss for the whole data after each step, some simplifications must be put up with. The straightforward option is to update priority only for sampled transitions in the current batch. New transitions can be added to experience replay with highest priority, i. e. 111111which can be computed online with complexity.
Second debatable issue of prioritized replay is that it actually substitutes loss function of DQN updates, which assumed uniform sampling of visited states to ensure they come from state visitation distribution:
While it is not clear what distribution is better to sample from to ensure exploration restrictions of theorem 3.1, prioritized experienced replay changes this distribution in uncontrollable way. Despite its fruitfulness at the beginning and midway of training process, this distribution shift may destabilize learning close to the end and make algorithm stuck with locally optimal policy. Since formally this issue is about estimating an expectation over one probability with preference to sample from another one, the standard technique called importance sampling can be used as countermeasure:
where is a number of transitions stored in experience replay memory. Importance sampling implies that we can avoid distribution shift that introduces undesired bias by making smaller gradient updates for significant transitions which now appear in the batches with higher frequency. The price for bias elimination is that importance sampling weights lower prioritization effect by slowing down learning of highlighted new information.
This duality resembles trade-off between bias and variance, but important moment here is that distribution shift does not cause any seeming issues at the beginning of training when agent behaves close to random and do not produce valid state visitation distribution anyway. The idea proposed in  based on this intuition is to anneal the importance sampling weights so they correct bias properly only towards the end of training procedure.
where and approaches 1121212often it is initialized by a constant close to 0 and is linearly increased until it reaches 1 as more interaction steps are executed. If is set to 0, no bias correction is held, while corresponds to unbiased loss function, i. e. equivalent to sampling from uniform distribution.
The most significant and obvious drawback of prioritized experience replay approach is that it introduces additional hyperparameters. Although represents one number, algorithm’s behaviour may turn out to be sensitive to its choosing, and must be designed by engineer as some scheduled motion from something near 0 to 1, and its well-turned selection may require inaccessible knowledge about how many steps it will take for algorithm to <<warm up>>.
3.7 Multi-step DQN
One more widespread modification of Q-learning in RL community is substituting one-step approximation present in Bellman optimality equation (6) with -step:
(-step Bellman optimality equation)
Indeed, definition of consists of average return and can be viewed as making steps from state after selecting action , while vanilla Bellman optimality equation represents as reward from one next step in the environment and estimation of the rest of trajectory reward recursively. -step Bellman equation (19) generalizes these two opposites.
All the same reasoning as for DQN can be applied to -step Bellman equation to obtain -step DQN algorithm, which only modification appears in target computation:
To perform this computation, we are required to obtain for given state and not only one next step, but steps. To do so, instead of transitions -step roll-outs are stored, which can be done by precomputing following tuples:
where is the reward received in steps after visitation of considered state , is state visited in steps, and is a flag whether the episode ended during -step roll-out131313all -step roll-outs must be considered including those terminated at -th step for .. All other aspects of algorithm remain the same in practical implementations, and the case corresponds to standard DQN.
The goal of using is to accelerate propagation of reward from terminal states backwards through visited states to as less update steps will be required to take into account freshly observed reward and optimize behaviour at the beginning of episodes. The price is that formula (20) includes an important trick: to calculate such target, for second (and following) step action must be sampled from for Bellman equation (19) to remain true. In other words, application of -step Q-learning is theoretically improper when behaviour policy differs from . Note that we do not face this problem in the case in which we are required to sample only from transition probability for given state-action pair .
Even considering , where is our current approximation of , makes -step DQN an on-policy algorithm when for every state-action pair it is preferable to sample target using the closest approximation of available. This questions usage of experience replay or at the very least encourages to limit its capacity to store only newest transitions with being relatively not very big.
To see the negative effect of -step DQN, consider the following toy example. Suppose agent makes a mistake on the second step after and ends episode with huge negative reward. Then in the case each time the roll-out starting with this is sampled in the batch, the value of will be updated with this received negative reward even if already learned not to repeat this mistake again.
Yet empirical results in many domains demonstrate that raising from 1 to 2-3 may result in substantial acceleration of training and positively affect the final performance. On the contrary, the theoretical groundlessness of this approach explains its negative effects when is set too big.
4 Distributional approach for value-based methods
4.1 Theoretical foundations
The setting of RL task inherently carries internal stochasticity of which agent has no substantial control. Sometimes intelligent behaviour implies taking risks with severe chance of low episode return. All this information resides in the distribution of return (1) as random variable.
While value-based methods aim at learning expectation of this random variable as it is the quantity we actually care about, in distributional approach  it is proposed to learn the whole distribution of returns. It further extends the information gathered by algorithm about MDP towards model-based case in which the whole MDP is imitated by learning both reward function and transitions , but still restricts itself only to reward and doesn’t intend to learn world model.
In this section we discuss some theoretical extensions of temporal difference ideas in the case when expectations on both sides of Bellman equation (5) and Bellman optimality equation (6) are taken away.
The central object of study in Q-learning was Q-function, which for given state and action returns the expectation of reward. To rewrite Bellman equation not in terms of expectations, but in terms of the whole distributions, we require a corresponding notation.
For given MDP and policy the value distribution of policy is a random variable defined as
Note that just represents a random variable which is taken expectation of in definition of -function:
Using this definition of value distribution, Bellman equation can be rewritten to extend the recursive connection between adjacent states from expectations of returns to the whole distributions of returns: (Distributional Bellman Equation) 
Here we used some auxiliary notation: by
we mean that cumulative distribution functions of two random variables to the right and left are equal almost everywhere. Such equations are calledrecursive distributional equations
and are well-known in theoretical probability theory141414to get familiar with this notion, consider this basic example:
While the space of Q-functions is finite, the space of value distributions is a space of mappings from state-action pair to continuous distributions:
and it is important to notice that even in the table-case when state and action spaces are finite, the space of value distributions is essentially infinite. Crucial moment for us will be that convergence properties now depend on chosen metric151515in finite spaces it is true that convergence in one metric guarantees convergence to the same point for any other metric..
The choice of metric in
represents the same issue as in the space of continuous random variables: if we choose a metric in the latter, we can construct one in the former: If is a metric in the space , then
is a metric in the space .
The particularly interesting for us example of metric in will be Wasserstein metric, which concerns only random variables with bounded moments, so we will additionally assume that for all state-action pairs
are finite for .
For for two random variables on continuous domain with -th bounded moments and cumulative distribution functions and correspondingly a Wasserstein distance<