Introduction
Mean of the return has been the center for reinforcement learning (RL) for a long time, and there have been many methods to learn a mean quantity (sutton1988learning ̵̃sutton1988learning; watkins1992q ̵̃watkins1992q; mnih2015human ̵̃mnih2015human). Thanks to the advances in distributional RL (jaquette1973markov ̵̃jaquette1973markov; bellemare2017distributional ̵̃bellemare2017distributional), we are able to learn the full distribution, not only the mean, for a stateaction value. Particularly, dabney2017distributional ̵̃dabney2017distributional used a set of quantiles to approximate this value distribution. However, the decision making in prevailing distributional RL methods is still based on the mean (bellemare2017distributional ̵̃bellemare2017distributional; dabney2017distributional ̵̃dabney2017distributional; barth2018distributed ̵̃barth2018distributed; qu2018nonlinear ̵̃qu2018nonlinear). The main motivation of this paper is to answer the questions of how to make decision based on the full distribution and whether an agent can benefit for better exploration. In this paper, we propose the Quantile Option Architecture (QUOTA) for control. In QUOTA, decision making is based on all quantiles, not only the mean, of a stateaction value distribution.
In traditional RL and recent distributional RL, an agent selects an action greedily with respect to the mean of the action values. In QUOTA, we propose to select an action greedily w.r.t. certain quantile of the action value distribution. A high quantile represents an optimisticestimation of the action value, and action selection based on a high quantile indicates an optimistic exploration strategy. A low quantile represents a pessimistic
estimation of the action value, and action selection based on a low quantile indicates a pessimistic exploration strategy. (The two exploration strategies are related to risksensitive RL, which will be discussed later.) We first compared different exploration strategies in two Markov chains, where naive meanbased RL algorithms fail to explore efficiently as they cannot exploit the distribution information during training, which is crucial for efficient exploration. In the first chain, faster exploration is from a high quantile (i.e., an optimistic exploration strategy). However, in the second chain, exploration benefits from a low quantile (i.e., a pessimistic exploration strategy). Different tasks need different exploration strategies. Even within one task, an agent may still need different exploration strategies at different stages of learning. To address this issue, we use the option framework (sutton1999between ̵̃sutton1999between). We learn a highlevel policy to decide which quantile to use for action selection. In this way, different quantiles function as different options, and we name this special option the
quantile option. QUOTA adaptively selects a pessimistic and optimistic exploration strategy, resulting in improved exploration consistently across different tasks.We make two main contributions in this paper:

First, we propose QUOTA for control in discreteaction problems, combining distributional RL with options. Action selection in QUOTA is based on certain quantiles instead of the mean of the stateaction value distribution, and QUOTA learns a highlevel policy to decide which quantile to use for decision making.

Second, we extend QUOTA to continuousaction problems. In a continuousaction space, applying quantilebased action selection is not straightforward. To address this issue, we introduce quantile actors. Each quantile actor is responsible for proposing an action that maximizes one specific quantile of a stateaction value distribution.
We show empirically QUOTA improves the exploration of RL agents, resulting in a performance boost in both challenging video games (Atari games) and physical robot simulators (Roboschool tasks)
In the rest of this paper, we first present some preliminaries of RL. We then show two Markov chains where naive meanbased RL algorithms fail to explore efficiently. Then we present QUOTA for both discrete and continuousaction problems, followed by empirical results. Finally, we give an overview of related work and closing remarks.
Preliminaries
We consider a Markov Decision Process (MDP) of a state space
, an action space , a reward “function”, which we treat as a random variable in this paper, a transition kernel
, and a discount ratio . We use to denote a stochastic policy. We use to denote the random variable of the sum of the discounted rewards in the future, following the policy and starting from the state and the action . We have , where and . The expectation of the random variable is , which is usually called the stateaction value function. We have the Bellman equationIn a RL problem, we are usually interested in finding an optimal policy such that . All the possible optimal policies share the same (optimal) state action value function . This is the unique fixed point of the Bellman optimality operator (bellman2013dynamic ̵̃bellman2013dynamic)
With tabular representation, we can use Qlearning (watkins1992q ̵̃watkins1992q) to estimate . The incremental update per step is
(1) 
where the quadruple
is a transition. There are lots of research and algorithms extending Qlearning to linear function approximation (sutton2018reinforcement ̵̃sutton2018reinforcement; szepesvari2010algorithms ̵̃szepesvari2010algorithms). In this paper, we focus on Qlearning with neural networks. mnih2015human ̵̃mnih2015human proposed DeepQNetwork (DQN), where a deep convolutional neural network
is used to parameterize. At every time step, DQN performs stochastic gradient descent to minimize
where the quadruple is a transition sampled from the replay buffer (lin1992self ̵̃lin1992self) and is the target network (mnih2015human ̵̃mnih2015human), which is a copy of and is synchronized with periodically. To speed up training and reduce the required memory of DQN, mnih2016asynchronous ̵̃mnih2016asynchronous further proposed the
step asynchronous Qlearning with multiple workers (detailed in Supplementary Material), where the loss function at time step
isDistributional RL
Analogous to the Bellman equation of , bellemare2017distributional ̵̃bellemare2017distributional proposed the distributional Bellman equation for the stateaction value distribution given a policy in the policy evaluation setting,
where means the two random variables and are distributed according to the same law. bellemare2017distributional ̵̃bellemare2017distributional also proposed a distributional Bellman optimality operator for control,
When making decision, the action selection is still based on the expected stateaction value (i.e., ). Since we have the optimality, now we need an representation for . dabney2017distributional ̵̃dabney2017distributional proposed to approximate by a set of quantiles. The distribution of is represented by a uniform mix of supporting quantiles:
where denote a Dirac at , and each is an estimation of the quantile corresponding to the quantile level (a.k.a. quantile index) with for . The stateaction value is then approximated by . Such approximation of a distribution is referred to as quantile approximation. Those quantile estimations (i.e., ) are trained via the Huber quantile regression loss (huber1964robust ̵̃huber1964robust). To be more specific, at time step the loss is
where and , where is the indicator function and is the Huber loss,
The resulting algorithm is the Quantile Regression DQN (QRDQN). QRDQN also uses experience replay and target network similar to DQN. dabney2017distributional ̵̃dabney2017distributional showed that quantile approximation has better empirical performance than previous categorical approximation (bellemare2017distributional ̵̃bellemare2017distributional). More recently, dabney2018implicit ̵̃dabney2018implicit approximated the distribution by learning a quantile function directly with the Implicit Quantile Network, resulting in further performance boost. Distributional RL has enjoyed great success in various domains (bellemare2017distributional ̵̃bellemare2017distributional; dabney2017distributional ̵̃dabney2017distributional; hessel2017rainbow ̵̃hessel2017rainbow; barth2018distributed ̵̃barth2018distributed; dabney2018implicit ̵̃dabney2018implicit).
Deterministic Policy
silver2014deterministic ̵̃silver2014deterministic used a deterministic policy for continuous control problems with linear function approximation, and lillicrap2015continuous ̵̃lillicrap2015continuous extended it with deep networks, resulting in the Deep Deterministic Policy Gradient (DDPG) algorithm. DDPG is an offpolicy algorithm. It has an actor and a critic , parameterized by and respectively. At each time step, is updated to minimize
And the policy gradient for in DDPG is
This gradient update is from the chain rule of gradient ascent w.r.t.
, where is interpreted as an approximation to . silver2014deterministic ̵̃silver2014deterministic provided policy improvement guarantees for this gradient.Option
An option (sutton1999between ̵̃sutton1999between) is a temporal abstraction of action. Each option is a triple , where is the option set. We use to denote the initiation set for the option , describing where the option can be initiated. We use to denote the intraoption policy for . Once the agent has committed to the option , it chooses an action based on . We use to denote the termination function for the option . At each time step , the option
terminates with probability
. In this paper, we consider the callandreturn option execution model (sutton1999between ̵̃sutton1999between), where an agent commits to an option until terminates according to .The option value function is used to describe the utility of an option at state , and we can learn this function via Intraoption Qlearning (sutton1999between ̵̃sutton1999between). The update is
where is a step size and is a transition in the cycle that the agent is committed to the option .
A Failure of Mean
(a) two Markov chains illustrating the inefficiency of meanbased decision making. (b)(c) show the required steps to find the optimal policy for each algorithm v.s. the chain length for Chain 1 and Chain 2 respectively. The required steps are averaged over 10 trials, and standard errors are plotted as error bars. For each trial, the maximum step is
.We now present two simple Markov chains (Figure 1) to illustrate meanbased RL algorithms can fail to explore efficiently.
Chain 1 has nonterminal states and two actions {LEFT, UP}. The agent starts at the state 1 in each episode. The action UP will lead to episode ending immediately with reward 0. The action LEFT
will lead to the next state with a reward sampled from a normal distribution
. Once the agent runs into the G terminal state, the episode will end with a reward . There is no discounting. The optimal policy is always moving left.We first consider tabular Qlearning with greedy exploration. To learn the optimal policy, the agent has to reach the G state first. Unfortunately, this is particularly difficult for Qlearning. The difficulty comes from two aspects. First, due to the greedy mechanism, the agent sometimes selects UP by randomness. Then the episode will end immediately, and the agent has to wait for the next episode. Second, before the agent reaches the G state, the expected return of either LEFT or UP at any state is 0. So the agent cannot distinguish between the two actions under the mean criterion because the expected returns are the same. As a result, the agent cannot benefit from the value estimation, a mean, at all.
Suppose now the agent learns the distribution of the returns of LEFT and UP. Before reaching the state G, the learned actionvalue distribution of LEFT is a normal distribution with a mean 0. A high quantile level of this distribution is greater than 0, which is an optimistic estimation. If the agent behaves according to this optimistic estimation, it can quickly reach the state G and find the optimal policy.
Chain 2 has the same state space and action space as Chain 1. However, the reward for LEFT is now except that reaching the G state gives a reward . The reward for UP is sampled from . There is no discounting. When is small, the optimal policy is still always moving left. Before reaching , the estimation of the expected return of LEFT for any nonterminal state is less than 0, which means a Qlearning agent would prefer UP. This preference is bad for this chain as it will lead to episode ending immediately, which prevents further exploration.
We now present some experimental results of four algorithms in the two chains: Qlearning, quantile regression Qlearning (QR, the tabular version of QRDQN), optimistic quantile regression Qlearning (OQR), and pessimistic quantile regression Qlearning (PQR). The OQR / PQR is the same as QR except that the behavior policy is always derived from a high / low quantile, not the mean, of the stateaction value distribution. We used greedy behavior policies for all the above algorithms. We measured the steps that each algorithm needs to find the optimal policy. The agent is said to have found the optimal policy at time if and only if the policy derived from the estimation (for Qlearning) or the mean of the estimation (for the other algorithms) at time step is to move left at all nonterminal states.
All the algorithms were implemented with a tabular state representation. was fixed to . For quantile regression, we used 3 quantiles. We varied the chain length and tracked the steps that an algorithm needed to find the optimal policy. Figure 1b and Figure 1c show the results in Chain 1 and Chain 2 respectively. Figure 1b shows the best algorithm for Chain 1 was OQR, where a high quantile is used to derive the behavior policy, indicating an optimistic exploration strategy. PQR performed poorly with the increase of the chain length. Figure 1c shows the best algorithm for Chain 2 was PQR, where a low quantile is used to derive the behavior policy, indicating a pessimistic exploration strategy. OQR performed poorly with the increase of the chain length. The meanbased algorithms (Qlearning and QR) performed poorly in both chains. Although QR did learn the distribution information, it did not use that information properly for exploration.
The results in the two chains show that quantiles influence exploration efficiency, and quantilebased action selection can improve exploration if the quantile is properly chosen. The results also demonstrate that different tasks need different quantiles. No quantile is the best universally. As a result, a highlevel policy for quantile selection is necessary.
The Quantile Option Architecture
We now introduce QUOTA for discreteaction problems. We have quantile estimations for quantile levels . We construct options . For simplicity, in this paper all the options share the same initiation set and the same termination function , which is a constant function. We use to indicate the intraoption policy of an option . proposes actions based on the mean of the th window of quantiles, where . (We assume is divisible by for simplicity.) Here represents a window size and we have windows in total. To be more specific, in order to compose the th option, we first define a stateaction value function by averaging a local window of quantiles:
We then define the intraoption policy of the th option to b
Comments
There are no comments yet.