1 Introduction
Humans can reason about symbolic objects and solve algorithmic problems. After learning to count and then manipulate numbers via simple arithmetic, people eventually learn to invent new algorithms and even reason about their correctness and efficiency. The ability to invent new algorithms is fundamental to artificial intelligence (AI). Although symbolic reasoning has a long history in AI
(Russell et al., 2003), only recently have statistical machine learning and neural network approaches begun to make headway in automated algorithm discovery
(Reed & de Freitas, 2016; Kaiser & Sutskever, 2016; Neelakantan et al., 2016), which would constitute an important milestone on the path to AI. Nevertheless, most of the recent successes depend on the use of strong supervision to learn a mapping from a set of training inputs to outputs by maximizing a conditional loglikelihood, very much like neural machine translation systems
(Sutskever et al., 2014; Bahdanau et al., 2015). Such a dependence on strong supervision is a significant limitation that does not match the ability of people to invent new algorithmic procedures based solely on trial and error.By contrast, reinforcement learning (RL) methods (Sutton & Barto, 1998) hold the promise of searching over discrete objects such as symbolic representations of algorithms by considering much weaker feedback in the form of a simple verifier that tests the correctness of a program execution on a given problem instance. Despite the recent excitement around the use of RL to tackle Atari games (Mnih et al., 2015) and Go (Silver et al., 2016), standard RL methods are not yet able to consistently and reliably solve algorithmic tasks in all but the simplest cases (Zaremba & Sutskever, 2014). A key property of algorithmic problems that makes them challenging for RL is reward sparsity, i.e., a policy usually has to get a long action sequence exactly right to obtain a nonzero reward.
We believe one of the key factors limiting the effectiveness of current RL methods in a sparse reward setting is the use of undirected exploration strategies (Thrun, 1992), such as greedy and entropy regularization (Williams & Peng, 1991). For long action sequences with delayed sparse reward, it is hopeless to explore the space uniformly and blindly. Instead, we propose a formulation to encourage exploration of action sequences that are underappreciated by the current policy. Our formulation considers an action sequence to be underappreciated if the model’s logprobability assigned to an action sequence underestimates the resulting reward from the action sequence. Exploring underappreciated states and actions encourages the policy to have a better calibration between its logprobabilities and observed reward values, even for action sequences with negligible rewards. This effectively increases exploration around neglected action sequences.
We term our proposed technique underappreciated reward exploration (UREX). We show that the objective given by UREX is a combination of a mode seeking objective (standard REINFORCE) and a mean seeking term, which provides a well motivated tradeoff between exploitation and exploration. To empirically evaluate our method, we take a set of algorithmic tasks such as sequence reversal, multidigit addition, and binary search. We choose to focus on these tasks because, although simple, they present a difficult sparse reward setting which has limited the success of standard RL approaches. The experiments demonstrate that UREX significantly outperforms baseline RL methods, such as entropy regularized REINFORCE and onestep Qlearning, especially on the more difficult tasks, such as multidigit addition. Moreover, UREX is shown to be more robust to changes of hyperparameters, which makes hyperparameter tuning less tedious in practice. In addition to introducing a new variant of policy gradient with improved performance, our paper is the first to demonstrate strong results for an RL method on algorithmic tasks. To our knowledge, the addition task has not been solved by any modelfree reinforcement learning approach. We observe that some of the policies learned by UREX can successfully generalize to long sequences; e.g., in out of random restarts, the policy learned by UREX for the addition task correctly generalizes to addition of numbers with digits with no mistakes, even though training sequences are at most digits long.
2 Neural Networks for Learning Algorithms
Although research on using neural networks to learn algorithms has witnessed a surge of recent interest, the problem of program induction from examples has a long history in many fields, including program induction, inductive logic programming
(Lavrac & Dzeroski, 1994), relational learning (Kemp et al., 2007) and regular language learning (Angulin, 1987). Rather than presenting a comprehensive survey of program induction here, we focus on neural network approaches to algorithmic tasks and highlight the relative simplicity of our neural network architecture.Most successful applications of neural networks to algorithmic tasks rely on strong supervision, where the inputs and target outputs are completely known
a priori. Given a dataset of examples, one learns the network parameters by maximizing the conditional likelihood of the outputs via backpropagation (
e.g., Reed & de Freitas (2016); Kaiser & Sutskever (2016); Vinyals et al. (2015)). However, target outputs may not be available for novel tasks, for which no prior algorithm is known to be available. A more desirable approach to inducing algorithms, followed in this paper, advocates using selfdriven learning strategies that only receive reinforcement based on the outputs produced. Hence, just by having access to a verifier for an algorithmic problem, one can aim to learn an algorithm. For example, if one does not know how to sort an array, but can check the extent to which an array is sorted, then one can provide the reward signal necessary for learning sorting algorithms.We formulate learning algorithms as an RL problem and make use of modelfree policy gradient methods to optimize a set parameters associated with the algorithm. In this setting, the goal is to learn a policy that given an observed state at step , estimates a distribution over the next action , denoted . Actions represent the commands within the algorithm and states represent the joint state of the algorithm and the environment. Previous work in this area has focused on augmenting a neural network with additional structure and increased capabilities (Zaremba & Sutskever, 2015; Graves et al., 2016)
. In contrast, we utilize a simple architecture based on a standard recurrent neural network (RNN) with LSTM cells
(Hochreiter & Schmidhuber, 1997) as depicted in Figure 1. At each episode, the environment is initialized with a latent state , unknown to the agent, which determines and the subsequent state transition and reward functions. Once the agent observes as the input to the RNN, the network outputs a distribution , from which an action is sampled. This action is applied to the environment, and the agent receives a new state observation . The state and the previous action are then fed into the RNN and the process repeats until the end of the episode. Upon termination, a reward signal is received.3 Learning a Policy by Maximizing Expected Reward
We start by discussing the most common form of policy gradient, REINFORCE (Williams, 1992), and its entropy regularized variant (Williams & Peng, 1991). REINFORCE has been applied to modelfree policybased learning with neural networks and algorithmic domains (Zaremba & Sutskever, 2015; Graves et al., 2016).
The goal is to learn a policy that, given an observed state at step , estimates a distribution over the next action , denoted . The environment is initialized with a latent vector, , which determines the initial observed state , and the transition function . Note that the use of nondeterministic transitions
as in Markov decision processes (MDP) may be recovered by assuming that
includes the random seed for the any nondeterministic functions. Given a latent state , and , the model probability of an action sequence is expressed as,The environment provides a reward at the end of the episode, denoted . For ease of readability we drop the subscript from and simply write and .
The objective used to optimize the policy parameters, , consists of maximizing expected reward under actions drawn from the policy, plus an optional maximum entropy regularizer. Given a distribution over initial latent environment states , we express the regularized expected reward as,
(1) 
When is a nonlinear function defined by a neural network, finding the global optimum of is challenging, and one often resorts to gradientbased methods to find a local optimum of . Given that for any such that , one can verify that,
(2) 
Because the space of possible actions is large, enumerating over all of the actions to compute this gradient is infeasible. Williams (1992) proposed to compute the stochastic gradient of the expected reward by using Monte Carlo samples. Using Monte Carlo samples, one first draws i.i.d. samples from the latent environment states , and then draws i.i.d. samples from to approximate the gradient of (1) by using (2) as,
(3) 
This reparametrization of the gradients is the key to the REINFORCE algorithm. To reduce the variance of (
3), one uses rewards that are shifted by some offset values,(4) 
where is known as a baseline or sometimes called a critic. Note that subtracting any offset from the rewards in (1) simply results in shifting the objective by a constant.
Unfortunately, directly maximizing expected reward (i.e., when ) is prone to getting trapped in a local optimum. To combat this tendency, Williams & Peng (1991) augmented the expected reward objective by including a maximum entropy regularizer () to promote greater exploration. We will refer to this variant of REINFORCE as MENT (maximum entropy exploration).
4 Underappreciated Reward Exploration (UREX)
To explain our novel form of policy gradient, we first note that the optimal policy , which globally maximizes in (1) for any , can be expressed as,
(5) 
where is a normalization constant making a distribution over the space of action sequences . One can verify this by first acknowledging that,
(6) 
Since is nonnegative and zero iff , then defined in (5) maximizes . That said, given a particular form of , finding that exactly characterizes may not be feasible.
The KL divergence is known to be mode seeking (Murphy, 2012, Section 21.2.2) even with entropy regularization (). Learning a policy by optimizing this direction of the KL is prone to falling into a local optimum resulting in a suboptimal policy that omits some of the modes of . Although entropy regularization helps mitigate the issues as confirmed in our experiments, it is not an effective exploration strategy as it is undirected and requires a small regularization coefficient to avoid too much random exploration. Instead, we propose a directed exploration strategy that improves the mean seeking behavior of policy gradient in a principled way.
We start by considering the alternate mean seeking direction of the KL divergence, . Norouzi et al. (2016) considered this direction of the KL to directly learn a policy by optimizing
(7) 
for structured prediction. This objective has the same optimal solution as since,
(8) 
Norouzi et al. (2016) argue that in some structured prediction problems when one can draw samples from , optimizing (7) is more effective than (1), since no sampling from a nonstationary policy is required. If is a loglinear model of a set of features, is convex in whereas is not, even in the loglinear case. Unfortunately, in scenarios that the reward landscape is unknown or computing the normalization constant is intractable, sampling from is not straightforward.
In RL problems, the reward landscape is completely unknown, hence sampling from is intractable. This paper proposes to approximate the expectation with respect to by using selfnormalized importance sampling (Owen, 2013), where the proposal distribution is and the reference distribution is . For importance sampling, one draws i.i.d. samples from and computes a set of normalized importance weights to approximate as,
(9) 
where denotes an importance weight defined by,
(10) 
One can view these importance weights as evaluating the discrepancy between scaled rewards and the policy’s logprobabilities . Among the samples, a sample that is least appreciated by the model, i.e., has the largest , receives the largest positive feedback in (9).
In practice, we have found that just using the importance sampling RAML objective in (9) does not always yield promising solutions. Particularly, at the beginning of training, when is still far away from , the variance of importance weights is too large, and the selfnormalized importance sampling procedure results in poor approximations. To stabilize early phases of training and ensure that the model distribution achieves large expected reward scores, we combine the expected reward and RAML objectives to benefit from the best of their mode and mean seeking behaviors. Accordingly, we propose the following objective that we call underappreciated reward exploration (UREX),
(11) 
which is the sum of the expected reward and RAML objectives. In our preliminary experiments, we considered a composite objective of , but we found that removing the entropy term is beneficial. Hence, the objective does not include entropy regularization. Accordingly, the optimum policy for is no longer , as it was for and . Appendix A derives the optimal policy for as a function of the optimal policy for . We find that the optimal policy of UREX is more sharply concentrated on the high reward regions of the action space, which may be an advantage for UREX, but we leave more analysis of this behavior to future work.
To compute the gradient of , we use the selfnormalized importance sampling estimate outlined in (9). We assume that the importance weights are constant and contribute no gradient to . To approximate the gradient, one draws i.i.d. samples from the latent environment states , and then draws i.i.d. samples from to obtain
(12) 
As with REINFORCE, the rewards are shifted by an offset . In this gradient, the model logprobability of a sample action sequence is reinforced if the corresponding reward is large, or the corresponding importance weights are large, meaning that the action sequence is underappreciated. The normalized importance weights are computed using a softmax operator .
5 Related Work
Before presenting the experimental results, we briefly review some pieces of previous work that closely relate to the UREX approach.
RewardWeighted Regression. Both RAML and UREX objectives bear some similarity to a method in continuous control known as RewardWeighted Regression (RWR) (Peters & Schaal, 2007; Wierstra et al., 2008). Using our notation, the RWR objective is expressed as,
(13)  
(14) 
To optimize , Peters & Schaal (2007) propose a technique inspired by the EM algorithm to maximize a variational lower bound in (14) based on a variational distribution . The RWR objective can be interpreted as a log of the correlation between and . By contrast, the RAML and UREX objectives are both based on a KL divergence between and .
To optimize the RWR objective, one formulates the gradient as,
(15) 
where denotes the normalization factor, i.e., . The expectation with respect to on the RHS can be approximated by selfnormalized importance sampling,^{1}^{1}1Bornschein & Bengio (2014) apply the same trick to optimize the loglikelihood of latent variable models. where the proposal distribution is . Accordingly, one draws Monte Carlo samples i.i.d. from and formulates the gradient as,
(16) 
where . There is some similarity between (16) and (9) in that they both use selfnormalized importance sampling, but note the critical difference that (16) and (9) estimate the gradients of two different objectives, and hence the importance weights in (16) do not correct for the sampling distribution as opposed to (9).
Beyond important technical differences, the optimal policy of is a one hot distribution with all probability mass concentrated on an action sequence with maximal reward, whereas the optimal policies for RAML and UREX are everywhere nonzero, with the probability of different action sequences being assigned proportionally to their exponentiated reward (with UREX introducing an additional rescaling; see Appendix A). Further, the notion of underappreciated reward exploration evident in , which is key to UREX’s performance, is missing in the RWR formulation.
Exploration. The RL literature contains many different attempts at incorporating exploration that may be compared with our method. The most common exploration strategy considered in valuebased RL is greedy Qlearning, where at each step the agent either takes the best action according to its current value approximation or with probability takes an action sampled uniformly at random. Like entropy regularization, such an approach applies undirected exploration, but it has achieved recent success in game playing environments (Mnih et al., 2013; Van Hasselt et al., 2016; Mnih et al., 2016).
Prominent approaches to improving exploration beyond greedy in valuebased or modelbased RL have focused on reducing uncertainty by prioritizing exploration toward states and actions where the agent knows the least. This basic intuition underlies work on counter and recency methods (Thrun, 1992), exploration methods based on uncertainty estimates of values (Kaelbling, 1993; Tokic, 2010), methods that prioritize learning environment dynamics (Kearns & Singh, 2002; Stadie et al., 2015), and methods that provide an intrinsic motivation or curiosity bonus for exploring unknown states (Schmidhuber, 2006; Bellemare et al., 2016).
In contrast to valuebased methods, exploration for policybased RL methods is often a byproduct of the optimization algorithm itself. Since algorithms like REINFORCE and Thompson sampling choose actions according to a stochastic policy, suboptimal actions are chosen with some nonzero probability. The Qlearning algorithm may also be modified to sample an action from the softmax of the Q values rather than the argmax
(Sutton & Barto, 1998).Asynchronous training has also been reported to have an exploration effect on both value and policybased methods. Mnih et al. (2016) report that asynchronous training can stabilize training by reducing the bias experienced by a single trainer. By using multiple separate trainers, an agent is less likely to become trapped at a policy found to be locally optimal only due to local conditions. In the same spirit, Osband et al. (2016) use multiple Q value approximators and sample only one to act for each episode as a way to implicitly incorporate exploration.
By relating the concepts of value and policy in RL, the exploration strategy we propose tries to bridge the discrepancy between the two. In particular, UREX can be viewed as a hybrid combination of valuebased and policybased exploration strategies that attempts to capture the benefits of each.
Perstep Reward. Finally, while we restrict ourselves to episodic settings where a reward is associated with an entire episode of states and actions, much work has been done to take advantage of environments that provide perstep rewards. These include policybased methods such as actorcritic (Mnih et al., 2016; Schulman et al., 2016) and valuebased approaches based on Qlearning (Van Hasselt et al., 2016; Schaul et al., 2016). Some of these valuebased methods have proposed a softening of Qvalues which can be interpreted as adding a form of maximumentropy regularizer (Asadi & Littman, 2016; Azar et al., 2012; Fox et al., 2016; Ziebart, 2010). The episodic totalreward setting that we consider is naturally harder since the credit assignment to individual actions within an episode is unclear.
6 Six Algorithmic Tasks
We assess the effectiveness of the proposed approach on five algorithmic tasks from the OpenAI Gym (Brockman et al., 2016), as well as a new binary search problem. Each task is summarized below, with further details available on the Gym website^{2}^{2}2gym.openai.com or in the corresponding opensource code.^{3}^{3}3github.com/openai/gym In each case, the environment has a hidden tape and a hidden sequence. The agent observes the sequence via a pointer to a single character, which can be moved by a set of pointer control actions. Thus an action is represented as a tuple where denotes how to move, is a boolean denoting whether to write, and is the output symbol to write.

[topsep=0em,itemsep=.5em,leftmargin=1.3em,parsep=0em]

Copy: The agent should emit a copy of the sequence. The pointer actions are move left and right.

DuplicatedInput: In the hidden tape, each character is repeated twice. The agent must deduplicate the sequence and emit every other character. The pointer actions are move left and right.

RepeatCopy: The agent should emit the hidden sequence once, then emit the sequence in the reverse order, then emit the original sequence again. The pointer actions are move left and right.

Reverse: The agent should emit the hidden sequence in the reverse order. As before, the pointer actions are move left and right.

ReversedAddition: The hidden tape is a grid of digits representing two numbers in base in littleendian order. The agent must emit the sum of the two numbers, in littleendian order. The allowed pointer actions are move left, right, up, or down.
The OpenAI Gym provides an additional harder task called ReversedAddition3, which involves adding three numbers. We omit this task, since none of the methods make much progress on it.
For these tasks, the input sequences encountered during training range from a length of to characters. A reward of is given for each correct emission. On an incorrect emission, a small penalty of is incurred and the episode is terminated. The agent is also terminated and penalized with a reward of if the episode exceeds a certain number of steps. For the experiments using UREX and MENT, we associate an episodic sequence of actions with the total reward, defined as the sum of the perstep rewards. The experiments using Qlearning, on the other hand, used the perstep rewards. Each of the Gym tasks has a success threshold, which determines the required average reward over episodes for the agent to be considered successful.
We also conduct experiments on an additional algorithmic task described below:

[topsep=.2em,itemsep=0em,leftmargin=1.3em,parsep=.2em]

BinarySearch: Given an integer , the environment has a hidden array of distinct numbers stored in ascending order. The environment also has a query number unknown to the agent that is contained somewhere in the array. The goal of the agent is to find the query number in the array in a small number of actions. The environment has three integer registers initialized at . At each step, the agent can interact with the environment via the four following actions:

[topsep=0em,itemsep=.2em,leftmargin=1.2em,parsep=0em]

: increment the value of the register for .

: divide the value of the register by 2 for .

: replace the value of the register with the average of the two other registers.

: compare the value of the register with and receive a signal indicating which value is greater. The agent succeeds when it calls on an array cell holding the value .
The agent is terminated when the number of steps exceeds a maximum threshold of steps and recieves a reward of . If the agent finds at step , it recieves a reward of .

We set the maximum number of steps to to allow the agent to perform a full linear search. A policy performing full linear search achieves an average reward of , because is chosen uniformly at random from the elements of the array. A policy employing binary search can find the number in at most steps. If is selected uniformly at random from the range , binary search yields an optimal average reward above . We set the success threshold for this task to an average reward of .
7 Experiments
We compare our policy gradient method using underappreciated reward exploration (UREX) against two main RL baselines: (1) REINFORCE with entropy regularization termed MENT (Williams & Peng, 1991), where the value of determines the degree of regularization. When , standard REINFORCE is obtained. (2) onestep double Qlearning based on bootstrapping one step future rewards.
7.1 Robustness to hyperparameters
Hyperparameter tuning is often tedious for RL algorithms. We found that the proposed UREX method significantly improves robustness to changes in hyperparameters when compared to MENT. For our experiments, we perform a careful grid search over a set of hyperparameters for both MENT and UREX. For any hyperparameter setting, we run the MENT and UREX methods times with different random restarts. We explore the following main hyperparameters:

[topsep=0em,itemsep=.2em,leftmargin=2em,parsep=0em]

The learning rate denoted chosen from a set of possible values .

The maximum L2 norm of the gradients, beyond which the gradients are clipped. This parameter, denoted , matters for training RNNs. The value of is selected from .

The temperature parameter that controls the degree of exploration for both MENT and UREX. For MENT, we use . For UREX, we only consider , which consistently performs well across the tasks.
In all of the experiments, both MENT and UREX are treated exactly the same. In fact, the change of implementation is just a few lines of code. Given a value of , for each task, we run training jobs comprising learning rates, clipping values, and random restarts. We run each algorithm for a maximum number of steps determined based on the difficulty of the task. The training jobs for Copy, DuplicatedInput, RepeatCopy, Reverse, ReversedAddition, and BinarySearch are run for , , , , , and stochastic gradient steps, respectively. We find that running a trainer job longer does not result in a better performance. Our policy network comprises a single LSTM layer with nodes. We use the Adam optimizer (Kingma & Ba, 2015) for the experiments.
REINFORCE / MENT  UREX  

Copy  85.0  88.3  90.0  3.3  75.0 
DuplicatedInput  68.3  73.3  73.3  0.0  100.0 
RepeatCopy  0.0  0.0  11.6  0.0  18.3 
Reverse  0.0  0.0  3.3  10.0  16.6 
ReversedAddition  0.0  0.0  1.6  0.0  30.0 
BinarySearch  0.0  0.0  1.6  0.0  20.0 
Table 1 shows the percentage of trials on different hyperparameters (, ) and random restarts which successfully solve each of the algorithmic tasks. It is clear that UREX is more robust than MENT to changes in hyperparameters, even though we only report the results of UREX for a single temperature. See Appendix B for more detailed tables on hyperparameter robustness.
7.2 Results
Table 2 presents the number of successful attempts (out of random restarts) and the expected reward values (averaged over trials) for each RL algorithm given the best hyperparameters. Onestep Qlearning results are also included in the table. We also present the training curves for MENT and UREX in Figure 2. It is clear that UREX outperforms the baselines on these tasks. On the more difficult tasks, such as Reverse and ReverseAddition, UREX is able to consistently find an appropriate algorithm, but MENT and Qlearning fall behind. Importantly, for the BinarySearch task, which exhibits many local maxima and necessitates smart exploration, UREX is the only method that can solve it consistently. The Qlearning baseline solves some of the simple tasks, but it makes little headway on the harder tasks. We believe that entropy regularization for policy gradient and greedy for Qlearning are relatively weak exploration strategies in long episodic tasks with delayed rewards. On such tasks, one random exploratory step in the wrong direction can take the agent off the optimal policy, hampering its ability to learn. In contrast, UREX provides a form of adaptive and smart exploration. In fact, we observe that the variance of the importance weights decreases as the agent approaches the optimal policy, effectively reducing exploration when it is no longer necessary; see Appendix E.
Num. of successful attempts out of  Expected reward  
Qlearning  MENT  UREX  Qlearning  MENT  UREX  
Copy  5  5  5  31.2  31.2  31.2 
DuplicatedInput  5  5  5  15.4  15.4  15.4 
RepeatCopy  1  3  4  39.3  69.2  81.1 
Reverse  0  2  4  4.4  21.9  27.2 
ReversedAddition  0  1  5  1.1  8.7  30.2 
BinarySearch  0  1  4  5.2  8.6  9.1 
7.3 Generalization to longer sequences
To confirm whether our method is able to find the correct algorithm for multidigit addition, we investigate its generalization to longer input sequences than provided during training. We evaluate the trained models on inputs up to a length of digits, even though training sequences were at most characters. For each length, we test the model on randomly generated inputs, stopping when the accuracy falls below . Out of the models trained on addition with UREX, we find that models generalize to numbers up to digits without any observed mistakes. On the best UREX hyperparameters, out of the random restarts are able to generalize successfully. For more detailed results on the generalization performance on different tasks including Copy, DuplicatedInput, and ReversedAddition, see Appendix C. During these evaluations, we take the action with largest probability from at each time step rather than sampling randomly.
We also looked into the generalization of the models trained on the BinarySearch task. We found that none of the agents perform proper binary search. Rather, those that solved the task perform a hybrid of binary and linear search: first actions follow a binary search pattern, but then the agent switches to a linear search procedure once it narrows down the search space; see Appendix D for some execution traces for BinarySearch and ReversedAddition. Thus, on longer input sequences, the agent’s running time complexity approaches linear rather than logarithmic. We hope that future work will make more progress on this task. This task is especially interesting because the reward signal should incorporate both correctness and efficiency of the algorithm.
7.4 Implementation details
In all of the experiments, we make use of curriculum learning. The environment begins by only providing small inputs and moves on to longer sequences once the agent achieves close to maximal reward over a number of steps. For policy gradient methods including MENT and UREX, we only provide the agent with a reward at the end of the episode, and there is no notion of intermediate reward. For the valuebased baseline, we implement onestep Qlearning as described in Mnih et al. (2016)Alg. , employing double Qlearning with greedy exploration. We use the same RNN in our policybased approaches to estimate the Q values. A grid search over exploration rate, exploration rate decay, learning rate, and sync frequency (between online and target network) is conducted to find the best hyperparameters. Unlike our other methods, the Qlearning baseline uses intermediate rewards, as given by the OpenAI Gym on a perstep basis. Hence, the Qlearning baseline has a slight advantage over the policy gradient methods.
In all of the tasks except Copy, our stochastic optimizer uses minibatches comprising policy samples from the model. These samples correspond to different random sequences drawn from the environment, and random policy trajectories per sequence. In other words, we set and as defined in (3) and (12). For MENT, we use the samples to subtract the mean of the coefficient of which includes the contribution of the reward and entropy regularization. For UREX, we use the trajectories to subtract the mean reward and normalize the importance sampling weights. We do not subtract the mean of the normalized importance weights. For the Copy task, we use minibatches with samples using and
. Experiments are conducted using Tensorflow
(Abadi et al., 2016).8 Conclusion
We present a variant of policy gradient, called UREX, which promotes the exploration of action sequences that yield rewards larger than what the model expects. This exploration strategy is the result of importance sampling from the optimal policy. Our experimental results demonstrate that UREX significantly outperforms other value and policy based methods, while being more robust to changes of hyperparameters. By using UREX, we can solve algorithmic tasks like multidigit addition from only episodic reward, which other methods cannot reliably solve even given the best hyperparameters. We introduce a new algorithmic task based on binary search to advocate more research in this area, especially when the computational complexity of the solution is also of interest. Solving these tasks is not only important for developing more humanlike intelligence in learning algorithms, but also important for generic reinforcement learning, where smart and efficient exploration is the key to successful methods.
9 Acknowledgment
We thank Sergey Levine, Irwan Bello, Corey Lynch, George Tucker, Kelvin Xu, Volodymyr Mnih, and the Google Brain team for insightful comments and discussions.
References
 Abadi et al. (2016) Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A system for largescale machine learning. arXiv:1605.08695, 2016.
 Angulin (1987) Dana Angulin. Learning regular sets form queries and counterexamples. Information and Computation, 1987.
 Asadi & Littman (2016) Kavosh Asadi and Michael L Littman. A new softmax operator for reinforcement learning. arXiv preprint arXiv:1612.05628, 2016.
 Azar et al. (2012) Mohammad Gheshlaghi Azar, Vicenç Gómez, and Hilbert J Kappen. Dynamic policy programming. Journal of Machine Learning Research, 13(Nov):3207–3245, 2012.
 Bahdanau et al. (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. ICLR, 2015.
 Bellemare et al. (2016) Marc G. Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Rémi Munos. Unifying countbased exploration and intrinsic motivation. NIPS, 2016.
 Bornschein & Bengio (2014) Jörg Bornschein and Yoshua Bengio. Reweighted wakesleep. arXiv:1406.2751, 2014.
 Brockman et al. (2016) Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. OpenAI Gym. arXiv:1606.01540, 2016.
 Fox et al. (2016) Roy Fox, Ari Pakman, and Naftali Tishby. Glearning: Taming the noise in reinforcement learning via soft updates. Uncertainty in Artifical Intelligence, 2016. URL http://arxiv.org/abs/1512.08562.

Golub (1987)
Gene Golub.
Some modified matrix eigenvalue problems.
SIAM Review, 1987.  Graves et al. (2016) Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka GrabskaBarwinska, Sergio G. Colmenarejo, Edward Grefenstette, Tiago Ramalho, John Agapiou, Adria P. Badia, Karl M. Hermann, Yori Zwols, Georg Ostrovski, Adam Cain, Helen King, Christopher Summerfield, Phil Blunsom, Koray Kavukcuoglu, and Demis Hassabis. Hybrid computing using a neural network with dynamic external memory. Nature, 2016.
 Hochreiter & Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. Long shortterm memory. Neural Comput., 1997.
 Kaelbling (1993) Leslie Pack Kaelbling. Learning in embedded systems. MIT press, 1993.
 Kaiser & Sutskever (2016) Lukasz Kaiser and Ilya Sutskever. Neural GPUs learn algorithms. ICLR, 2016.
 Kearns & Singh (2002) Michael Kearns and Satinder Singh. Nearoptimal reinforcement learning in polynomial time. Machine Learning, 2002.
 Kemp et al. (2007) Charles Kemp, Noah Goodman, and Joshua Tenebaum. Learning and using relational theories. NIPS, 2007.
 Kingma & Ba (2015) Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ICLR, 2015.
 Lavrac & Dzeroski (1994) N. Lavrac and S. Dzeroski. Inductive Logic Programming: Theory and Methods. Ellis Horwood, 1994.
 Mnih et al. (2013) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin A. Riedmiller. Playing atari with deep reinforcement learning. arXiv:1312.5602, 2013.
 Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, et al. Humanlevel control through deep reinforcement learning. Nature, 2015.
 Mnih et al. (2016) Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy P Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. ICML, 2016.
 Murphy (2012) Kevin P. Murphy. Machine Learning: A Probabilistic Perspective. MIT Press, 2012.
 Neelakantan et al. (2016) Arvind Neelakantan, Quoc V. Le, and Ilya Sutskever. Neural programmer: Inducing latent programs with gradient descent. ICLR, 2016.
 Norouzi et al. (2016) Mohammad Norouzi, Samy Bengio, Zhifeng Chen, Navdeep Jaitly, Mike Schuster, Yonghui Wu, and Dale Schuurmans. Reward augmented maximum likelihood for neural structured prediction. NIPS, 2016.
 Osband et al. (2016) Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep exploration via bootstrapped DQN. NIPS, 2016.
 Owen (2013) Art B. Owen. Monte Carlo theory, methods and examples. 2013.
 Peters & Schaal (2007) Jan Peters and Stefan Schaal. Reinforcement learning by rewardweighted regression for operational space control. In Proceedings of the 24th international conference on Machine learning, pp. 745–750. ACM, 2007.
 Reed & de Freitas (2016) Scott E. Reed and Nando de Freitas. Neural programmerinterpreters. ICLR, 2016.
 Russell et al. (2003) Stuart Jonathan Russell, Peter Norvig, John F Canny, Jitendra M Malik, and Douglas D Edwards. Artificial intelligence: a modern approach, volume 2. Prentice hall Upper Saddle River, 2003.
 Schaul et al. (2016) Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. ICLR, 2016.
 Schmidhuber (2006) Jürgen Schmidhuber. Optimal artificial curiosity, creativity, music, and the fine arts. Connection Science, 2006.
 Schulman et al. (2016) John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. Highdimensional continuous control using generalized advantage estimation. ICLR, 2016.
 Silver et al. (2016) David Silver, Aja Huang, et al. Mastering the game of Go with deep neural networks and tree search. Nature, 2016.
 Stadie et al. (2015) Bradly C. Stadie, Sergey Levine, and Pieter Abbeel. Incentivizing exploration in reinforcement learning with deep predictive models. arXiv:1507.00814, 2015.
 Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to sequence learning with neural networks. NIPS, 2014.
 Sutton & Barto (1998) Richard S. Sutton and Andrew G. Barto. Introduction to Reinforcement Learning. MIT Press, 1998.
 Thrun (1992) Sebastian B Thrun. Efficient exploration in reinforcement learning. Technical report, 1992.
 Tokic (2010) Michel Tokic. Adaptive greedy exploration in reinforcement learning based on value differences. AAAI, 2010.
 Van Hasselt et al. (2016) Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double qlearning. AAAI, 2016.
 Vinyals et al. (2015) Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. Pointer networks. NIPS, 2015.
 Wierstra et al. (2008) Daan Wierstra, Tom Schaul, Jan Peters, and Juergen Schmidhuber. Episodic reinforcement learning by logistic rewardweighted regression. In International Conference on Artificial Neural Networks, pp. 407–416. Springer, 2008.
 Williams (1992) Ronald J. Williams. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine Learning, 1992.
 Williams & Peng (1991) Ronald J Williams and Jing Peng. Function optimization using connectionist reinforcement learning algorithms. Connection Science, 1991.
 Zaremba & Sutskever (2014) Wojciech Zaremba and Ilya Sutskever. Learning to execute. arXiv:1410.4615, 2014.
 Zaremba & Sutskever (2015) Wojciech Zaremba and Ilya Sutskever. Reinforcement learning neural turing machines. arXiv:1505.00521, 2015.
 Ziebart (2010) Brian D Ziebart. Modeling purposeful adaptive behavior with the principle of maximum causal entropy. 2010.
Appendix A Optimal Policy for the UREX Objective
To derive the form of the optimal policy for the UREX objective (11), note that for each one would like to maximize
(17) 
subject to the constraint . To enforce the constraint, we introduce a Lagrange multiplier and aim to maximize
(18) 
Since the gradient of the Lagrangian (18) with respect to is given by
(19) 
the optimal choice for is achieved by setting
(20) 
forcing the gradient to be zero. The Lagrange multiplier can then be chosen so that while also satisfying ; see e.g. (Golub, 1987).
Appendix B Robustness to Hyperparameters
Tables 3–8 provide more details on different cells of Table 1. Each table presents the results of MENT using the best temperature vs. UREX with on a variety of learning rates and clipping values. Each cell is the number of trials out of random restarts that succeed at solving the task using a specific and .
MENT ()  UREX ()  

3  5  5  5  5  2 
5  4  5  5  5  3  
3  5  5  4  4  1  
4  5  5  4  5  2  

MENT ()  UREX ()  

3  5  3  5  5  5 
2  5  3  5  5  5  
4  5  3  5  5  5  
2  5  4  5  5  5  

MENT ()  UREX ()  

0  1  0  0  2  0 
0  0  2  0  4  0  
0  0  1  0  2  0  
0  0  3  0  3  0  

MENT ()  UREX ()  

1  1  0  0  0  0 
0  1  0  0  4  0  
0  2  0  0  2  1  
1  0  0  0  2  1  

MENT ()  UREX ()  

0  0  0  0  0  4 
0  0  0  0  3  2  
0  0  0  0  0  5  
0  0  1  0  1  3  

MENT ()  UREX ()  

0  0  0  0  4  0 
0  1  0  0  3  0  
0  0  0  0  3  0  
0  0  0  0  2  0  

Appendix C Generalization to Longer Sequences
Table 9 provides a more detailed look into the generalization performance of the trained models on Copy, DuplicatedInput, and ReversedAddition. The tables show how the number of models which can solve the task correctly drops off as the length of the input increases.
Copy  DuplicatedInput  ReversedAddition  
MENT  UREX  MENT  UREX  MENT  UREX  
30  54  45  44  60  1  18 
100  51  45  36  56  0  6 
500  27  22  19  25  0  5 
1000  3  2  12  17  0  5 
2000  0  0  6  9  0  5 
Max  1126  1326  2000  2000  38  2000 
Appendix D Example Execution Traces
We provide the traces of two trained agents on the ReversedAddition task (Figure 3) and the BinarySearch task (Table 10).
Inferred range  
512  0  0  –  
512  0  256  –  
512  0  256  
256  0  256  –  
256  0  128  –  
256  0  128  
128  0  128  –  
128  0  64  –  
128  0  64  
128  96  64  –  
128  96  64  
128  96  112  –  
128  96  112  
128  120  112  –  
128  120  112  
128  60  112  –  
128  60  94  –  
128  60  94  
128  111  94  –  
128  111  94  
128  112  94  –  
128  112  95  –  
128  112  95  
128  112  96  –  
128  112  96  
128  112  97  –  
128  112  97  
128  112  98  –  
128  112  98  
128  112  99  –  
128  112  99  
128  112  100  –  
128  112  100  –  – 
Comments
There are no comments yet.