A3C style Option-Critic with deliberation cost
Recent work has shown that temporally extended actions (options) can be learned fully end-to-end as opposed to being specified in advance. While the problem of "how" to learn options is increasingly well understood, the question of "what" good options should be has remained elusive. We formulate our answer to what "good" options should be in the bounded rationality framework (Simon, 1957) through the notion of deliberation cost. We then derive practical gradient-based learning algorithms to implement this objective. Our results in the Arcade Learning Environment (ALE) show increased performance and interpretability.READ FULL TEXT VIEW PDF
A3C style Option-Critic with deliberation cost
Temporal abstraction has a rich history in AI [Minsky1961, Fikes et al.1972, Kuipers1979, Korf1983, Iba1989, Drescher1991, Dayan and Hinton1992, Kaelbling1993, Thrun and Schwartz1995, Parr and Russell1998, Dietterich1998]
and has been presented as a useful mechanism for a variety of problems that affect AI systems in may settings, including to: generate shorter plans, speed up planning, improve generalization, yield better exploration, increase robustness against model mis-specification or partial observability. In reinforcement learning,options [Sutton et al.1999b] provide a framework to represent, learn and plan with temporally extended actions. Interest in temporal abstraction in reinforcement learning has increased substantially in the last couple of years, due to increasing success in constructing such abstractions automatically from data, e.g. [Bacon et al.2017, Kulkarni et al.2016, Daniel et al.2016, Mankowitz et al.2016, Machado et al.2017]. However, defining what constitutes a good set of options remains an open problem.
In this paper, we aim to leverage the bounded rationality framework [Simon1957]
in order to explain what would make good temporal abstractions for an RL system. A lot of existing reinforcement learning work has focused on Markov Decision Processes, where optimal policies can be obtained under certain assumptions. However, optimality does not take into account possible resource limitations of the agent, which is assumed to have access to a lot of data and computation time. Indeed, options help agents overcome such limitations, by allowing policies to be computed faster[Dietterich1998, Precup2000]. However, from the point of view of absolute optimality, temporal abstractions are not necessary: the optimal policy is achieved by primitive actions. Therefore, it has been difficult to formalize in what precise theoretical sense temporally abstract actions are helpful.
Bounded rationality is a very important framework for understanding rationality in both natural and artificial systems. In this paper, we propose bounded rationality as a lens through which we can describe the desiderata for constructing temporal abstractions, as their goal is mainly to help agents which are restricted in terms of computation time. This perspective helps us to formulate more precisely what objective criteria that should be fulfilled during option construction. We propose that good options are those which allow an agent to learn and plan faster, and provide an optimization objective for learning options based on this idea. We implement the optimization using the option-critic framework [Bacon et al.2017] and illustrate its usefulness with experiments in Atari games.
A finite discounted Markov Decision Process is a tuple where and denote the state and action set respectively, and is a discount factor. The reward function is often assumed to be a deterministic function of the state and actions, but can also map to a distribution, (a perspective which we use in our formulation). The transition matrix is a conditional distribution over next states given that an action is taken under a certain state . The interaction of a randomized stationary policy or a deterministic policy with an MDP induces a Markov process over states, actions and rewards over which is defined the expected discounted return . The value function of a policy satisfies the Bellman equations :
In the control problem, we are interested in finding an optimal policy for a given MDP. A policy is said to be optimal if for all .
An important class of control methods in reinforcement learning is based on the actor-critic architecture [Sutton1984]. In the same way that function approximation can be used for value functions, policies can also be approximated within a parameterized family which is searched over. In the policy gradient theorem, [Sutton et al.1999a] shows that the gradient of the expected discounted return with respect to the parameters of a policy is of the form , where is an initial state distribution. A locally optimal policy can then be found by stochastic gradient ascent over the policy parameters while simultaneously learning the action-value function (usually by TD).
Options [Sutton et al.1999b] provide a framework for representing, planning and learning with temporally abstraction actions. The option frameworks assumes the existence of a base MDP on which are overlaid temporally abstract actions called options. An option is defined as a triple where is an initiation set, is the policy of an option (which can also be deterministic) and is a termination condition. In the call-and-return execution model, a policy over options (deterministic if wanted) chooses an option among those which can be initiated in a given state and executes the policy of that option until termination. Once the chosen option has terminated, the policy over options chooses a new option and the process is repeated until the end of the episode.
in which the transition time between two decision points is a random variable. When considering the induced process only at the level of state-option pairs, usual dynamic programming results can be reused after a transformation to an equivalent MDP[Puterman1994]. To see this, we need to define two kinds of models for every option : a reward model and a transition model . If an option does not depend on the history since initiation, we can write its models either in closed form or as the solution to Bellman-like equations [Sutton et al.1999b]. The expected discounted return associated with a set of options and a policy over them is the solution to a set of Bellman equations :
where is a concatenation of the policy over options , options policies and termination conditions.
In the case of Markov options, there exists another form for the Bellman equations, called the intra-option Bellman equations [Sutton et al.1999b], which are key for deriving gradient-based algorithms for learning options.
Let be a random variable over state-option tuples. We call the space of state-option pairs the augmented state space. This augmentation is sufficient to provide the Markov property, which would otherwise be lost when considering the process at the flat level of state-action pairs [Sutton et al.1999b]. The transition matrix of the Markov process over the augmented state space [Bacon et al.2017] is given by :
Using this chain structure, we can define the MDP whose associated value function is:
Since the rewards come from the base (primitive) MDP, we can simply write and because , we get:
Hence, when taking the expectation in (1) over the next values, we obtain :
where is the advantage function [Baird1993]. The equations in (2) correspond exactly to the intra-option Bellman equations [Sutton et al.1999b]. However, we chose to present them under an alternate – but more convenient – form highlighting a connection to the advantage function:
where represents the utility of continuing with the same option or switching to a better one.
The option-critic architecture [Bacon et al.2017] is a gradient-based actor-critic architecture for learning options end-to-end. As in actor-critic methods, the idea is to parametrize the options policies and termination conditions and learn their parameters jointly by stochastic gradient ascent on the expected discounted return. [Bacon et al.2017] provided the form of the gradients for both the option policies and termination functions
under the assumption that options are available everywhere. In the following, we further assume that the parameter vectoris partitioned into disjoint sets of parameters for the policy over option, the option policies and the termination functions.
In the gradient theorem for options policies [Bacon et al.2017], the result maintains the same form as that of original policy gradient theorem for MDP [Sutton et al.1999a] but over the augmented state space. If is the expected discount return for the set of options and the policy over them, then the gradient of the option policies (whose parameters are independent from the terminations) is :
where is an initial state distribution over state and options.
To obtain the gradient for the termination functions, let’s first take the derivative of the intra-option Bellman equations:
By noticing the similarity between (3) and (1), we can easily solve for the recursive form of the derivative. Indeed, it suffices to see that plays the role of the “reward” term in the usual Bellman equations (see [Bacon et al.2017] for a detailed proof) and conclude that:
Hence the termination gradient shows that if an option is advantageous, the probability of termination should be lowered, making that option longer. Conversely, if the value of an option is less than what could be achieved through a different choice of option at a given state, the termination gradient will make it more likely to terminate at this state. The termination gradient has the same structure as the interruption operator[Mann et al.2014] in the interruption execution model [Sutton et al.1999b]. Rather than executing the policy of an option irrevocably until termination, interruption execution consists in choosing a new option whenever . Moving the the value function to the left hand side, interruption execution can also be understood in terms of the advantage function: . As for the termination gradient, interruption execution leads to the termination of an option whenever there is no advantage (negative advantage) in maintaining it. Interestingly, [Mann et al.2014] also considered adding a scalar regularizer to the advantage function to favor longer options. From the more general perspective of bounded rationality, we also recover this regularizer but within a larger family which follows from the notion of deliberation cost.
From a representation learning perspective, good options ought to allow an agent to learn and plan faster [Minsky1961]. Due to their temporal structure, options offer a mechanism through which an agent can make better use of its limited computational resources and act faster. Once an option has been chosen, we assume that the computational cost of executing that option is negligible or constant until termination. After deliberating on the choice of option, an agent can relax thanks to the fast – but perhaps imperfect – knowledge compiled within the its policy.
This perspective on options is similar to fast and frugalheuristics [Gigerenzer and Selten2001] which form a decision repertoire for efficient decision making under limited resource. Our assumption on the cost structure is also consistent with models of the prefrontal areas [Botvinick et al.2009, Solway et al.2014] presenting decision making over options as a slower model-based planning process as opposed to fast and habitual learning taking place within an option. When planning with options (in computers), there is also a cost for deciding which option to choose next by making predictions based on their models. For example, options models could be given by deep networks, necessitating back-and-forth to the GPU, or using a simulator with costly explicit rollouts [Guo et al.2014, Mann et al.2015].
Bounded rationality can also be useful to understand how efficient communication can take place between two agents over a limited channel [Neyman1985]. Options offer a mechanism for communicating intents to and from an agent [Branavan et al.2012, Andreas et al.2017] more efficiently, by compressing the information into a simpler form, sending only the identifier of the options and not the details themselves. Having longer options is a way to provide better interpretability and simplifies communication and understanding by compressing information.
Consider the cost model (fig. 1) in which executing an option within an option is free but switching to an option upon arriving in a new state incurs a cost . To build some intuition, let’s further assume that the termination function of an option is constant over all states. If is the continuation probability of that option, its expected discounted duration is . When a fixed cost is incurred upon termination, the average cost per step for that option is then . Hence, as the probability of continuation increases and options get longer, the cost rate decreases. Conversely, if an option only terminates after one step – a primitive option – is and the cost rate is . The fact that longer options lead to a better amortization of the deliberation cost is key to understanding their benefit in comparison to using only primitive actions.
In addition to the value function for the base MDP and options over them, we define an immediate cost function and a corresponding deliberation cost function . The expected sum of discounted costs associated with a set of options and the policy over them is given by the function :
We first formulate our goal of maximizing the expected return while keeping the deliberation cost low as a constrained optimization problem:
where is an initial state distribution over state-option pairs. But in general, solving a problem of this form [Altman1999]
requires a Linear Programming (LP) formulation which is both expensive to solve and incompatible with the model-free learning methods adopted in this work. Instead, we consider the unconstrained optimization problem arising from the Lagrangian formulation[Sennott1991, Altman1999]:
and is a regularization coefficient. While (5) shows the option-value function and the deliberation cost function as separate entities, they can in fact bee seen as a single MDP whose reward function is the difference of the base MDP reward and the cost function:
Therefore, there is a set of Bellman equations which the value function over the transformed reward function satisfies:
Similarly, there exist Bellman optimality equations in the sense of [Sutton et al.1999b] for the parameters of the policy over options :
where the notation here indicates that the parameters for the options are kept fixed and only is allowed to change. A policy over option is -optimal with respect to a set of options if it reaches the maximum in (7) for a given . Clearly, when , the corresponding is also optimal in the base MDP and there is no loss of optimality in this regard.
One way to favor long options is by a cost function which penalizes for frequent options switches. In the same way that the MDP formulation allows for randomized reward functions [Puterman1994], we can also capture the random event of switching through the immediate cost function . Since is the mean of a Bernoulli random variable over the two possible outcomes, switching or continuing ( or ), the cost function corresponding to the switching event is (where was added for mathematical convenience).
When expanding the value function over the transformed reward (6) for this choice of , we get:
with appearing along with the advantage function : a term which would otherwise be absent from the intra-option Bellman equations over the base MDP (2). Therefore, adding the switching cost function to the base MDP reward contributes a scalar margin to the advantage function over the transformed reward. When learning termination functions in option-critic, the termination gradient for the unconstrained problem (5) is then of the form:
Hence, sets a margin or a baseline
for how good an option ought to be : a correction which might be due to approximation error or to reflect some form of uncertainty in the value estimates. By increasing its value, we can reduce the gap in the advantage function, tilting the balance in favor of maintaining an option rather than terminating it.
Due to the generality of our formulation, the discount factor of the deliberation cost function can be different from that of the value function over the base MDP reward. The unconstrained formulation of (5) then becomes a function of two discount factors: for base MDP and for the deliberation cost function:
Since the derivative of the deliberation cost function with respect to the termination parameters is:
setting when the cost function is leaves only one term : . Hence, by linearity with (4), the derivative over the mixed objective is:
While similar to (9) in the sense that the margin also enters the advantage function, (10) differs fundamentally in the fact that it depends on and not , the advantage function over the transformed reward. We can also see that when , we recover the same form for the derivative of the expected return in the transformed MDP from (9):
The discount factor for the deliberation cost function provides a mechanism for truncating the sum of costs. Therefore, it plays a distinct role from the regularization coefficient which merely scales the deliberation cost function but does not affect the computational horizon. As opposed to the random horizon set by the discount factor in the environment, pertains to the internal environment of agent about the cost of its own cognitive or computational processes. It is a parameter about an introspective process of self-prediction of how likely a sequence of internal costs will be accumulated as a result of deliberating about courses of action in the outside environment. In accordance with more general results on discounting [Petrik and Scherrer2008, Jiang et al.2015], should be aligned with the representational capacity of the system since involves an increasingly more difficult prediction problem.
In that sense, indicates that only the immediate computational cost should be considered when learning options that also maximize for the reward. When learning termination functions, the resulting shallow evaluation under small values of might not take into account the possibility that the overall expected cost could be lowered in exchange of a less favorable immediate cost : it lacks foresight. Despite the fact that the full effect of a change in the options or the policy over them cannot be captured with , the corresponding gradient (9) is still useful when . It leads to both the regularization strategy proposed in [Bacon et al.2017] for gradient-based learning and [Mann et al.2014] in the dynamic programming case. Furthermore, since (9) does not depend on , values can be learned only for the original reward function and does not require mixed or separate estimates.
Previous results [Bacon et al.2017] in the Arcade Learning Environment [Bellemare et al.2013] have shown that while learning options end-to-end is possible, frequent terminations can become an issue unless regularization is used. Hence, we chose to apply the idea of deliberation cost in combination with a novel option-critic implementation based on the Advantage Asynchronous Actor-Critic (A3C) architecture of [Mnih et al.2016]. More specifically, our experiments are meant to assess : the interpretability of the resulting options, whether degeneracies (frequent terminations) to single-step options can be controlled, and if the deliberation cost can provide an inductive bias for learning faster.
The option-critic architecture [Bacon et al.2017] had introduced a deep RL version of the algorithm, which allowed one to learn options in an end-to-end fashion, directly from pixels. However, it was built on top of the DQN algorithm [Mnih et al.2015], which is an off-line algorithm using samples from an experience replay buffer. Option-critic, on the other hand, is an on-line algorithm which uses every new sampled transition for its updates. Using on-line samples has been known to cause issues when training deep networks.
Recently, the asynchronous advantage actor-critic (A3C) algorithm [Mnih et al.2016] addressed this issue and lead to to stable on-line learning by running multiple parallel agents. The parallel agents allows the deep networks to see samples from very different states, which greatly stabilizes learning. This algorithm is also much more consistent with the spirit of option-critic, as they both use on-line policy gradients to train. We introduce the 111The source code is available at https://github.com/jeanharb/a2oc_delibasynchronous advantage option-critic (A2OC), an algorithm (alg. 1) that learns options in a similar way to A3C but within the option-critic architecture.
The architecture used for A2OC was kept as consistent with A3C as possible. We use a convolutional neural network of the same size, which outputs a feature vector that is shared among 3 heads as in[Bacon et al.2017]
: the option policies, the termination functions and the Q-value networks. The option policies are linear softmax functions, the termination functions use sigmoid activation functions to represent probabilities of terminating and the Q-values are simply linear layers. During training, all gradients are summed together, and updating is performed in a single thread instance. A3C only needs to learn a value function for its policy, as opposed to Q-values for every action. Similarly, A2OC gets away with the action dimension through sampling[Bacon et al.2017] but needs to maintain state-option value because of the underlying augmented state space.
As for the hyperparameters, we use an-greedy policy over options, with . The preprocessing are the same as the A3C, with RGB pixels scaled to grayscale images. The agent repeats actions for 4 consecutive moves and receives stacks of 4 frames as inputs. We used entropy regularization of , which pushes option policies not to collapse to deterministic policies. A learning rate of was used in all experiments. We usually trained the agent with parallel threads.
We use Amidar, a game of the Atari 2600 suite, as an environment that allows us to analyze the option policies and terminations qualitatively. The game is grid-like and the task is to cover as much ground as possible, without running into enemies.
Without a deliberation cost, the options eventually learn to terminate at every step, as seen in figure 1(a), where every color in the figure represents a different option being executed at the time the agent was in the location. In contrast, figure 1(b) shows the effect of training an agent with a deliberation cost, which persists with an option over a long period of time. The temporally extended structure of the options shown by color does not result from simply terminating and re-picking the same option at every step but truly represents a contiguous segment of the trajectory where that option was maintained in a call-and-return fashion. Only at certain intersections the options terminate, allowing the agent to select an option which will lead it in a different direction. As opposed to the agent which was trained without a deliberation cost (fig. 1(a)), figure 1(a) shows that the options learned with the regularizer were specialized and only selected in specific scenarios. Figure 1(c) shows us where the agent terminated options on its trajectory. The options are clearly terminating at intersections, which represent key decision points.
We also trained agents with multiple levels of deliberation cost, from to , with increments of . The range of values was chosen according to the general scale of the values proper to these environments. As figure 4 shows, an increase in deliberation cost quickly decreases the average termination probabilities as expected by the formulation (5). When no deliberation cost is used, termination raises up to very quickly, meaning each option only lasts a single time-step. The decrease in probability is not the same in every environment, this is due to the difference in returns. The deliberation cost has an effect proportional to its ratio with the state values. Intuitively, environments with many high rewards would indeed require a larger deliberation cost to have substantial effects.
We presented the use of deliberation cost as a way to incentivize the creation of options which persist for a longer period of time. Using this approach in the option-critic architecture yields both good performance as well as options which are intuitive and do not shrink over time. In doing so, we also outlined a connection from our more general notion of deliberation cost with previous notions of regularization from [Mann et al.2014] and [Bacon et al.2017].
The deliberation cost goes beyond only the idea of penalizing for lengthy computation. It can also be used to incorporate other forms of bounds intrinsic to an agent in its environment. One interesting direction for future work is to also think of deliberation cost in terms of missed opportunity and opening the way for an implicit form of regularization when interacting asynchronously with an environment. Another interesting form of limitation inherent to reinforcement learning agents has to do with their representational capacities when estimating action values. Preliminary work seems to indicate that the error decomposition for the action values could be also be expressed in the form of a deliberation cost.
Journal of Artificial Intelligence Research, 47:253–279, 06 2013.