Generalizing from data in a machine learning algorithm involves a training process, where such algorithm learns the model structure and parameters that best fit the available data. Training, in turn, depends on a prior design process that defines the hyper-parameters that constraint the conditions of data-driven learning. Setting them properly is crucial to the learning process, and can make the difference between mediocre and state-of-art model prediction[hutter_beyond_2015]. In particular, optimizing the hyper-parameters of reinforcement learning algorithms [sutton_reinforcement_2018] is a hard task, because data is not provided a priori, but increasingly generated through interactions with the environment. Hence, hyper-parameters determine which data is generated. In turn, such data determines the parameter values, which also influences the next set of data generated, and so on and so forth.
Hyper-parameters are usually manually optimized, which can be very inefficient [hutter_beyond_2015], or by methods such as random search [bergstra_random_2012] or Bayesian optimization (BO) [mockus_application_1978] [shahriari_taking_2016]. The latter performs a black-box optimization of a function , resorting both to a prior distribution and to the data available points
in order to compute the mean and variance for unseen inputs, typically predicted by Gaussian process (GP) regression[rasmussen_gaussian_2008], which is used to maximize an acquisition function that is cheap to optimize globally. The most common acquisition function is expected improvement
, where the next point is decided by considering the probability of the next maximum, pondered by the predicted variance. The issue with random search is that the method uses very limited information about previous queries of. On the other hand, while Bayesian optimization uses information regarding past queries, it also has two limitations similarly to random search: 1) it involves no assumption about the influence of hyper-parameters on the information content, and 2) it is no efficient to optimize categorical hyper-parameters such as the RL algorithm selected. In this work, an algorithm is proposed that, by assuming a hierarchical relationship between RL hyper-parameters, optimizes such structural hyper-parameters first, and then uses traditional Bayesian optimization to tune the real-valued hyper-parameters of the learning algorithm. The proposed algorithm is validated against random search and Bayesian optimization in the classical Cart-pole environment.
2 Reinforcement learning
Reinforcement learning [sutton_reinforcement_2018]
is a sub-area of machine learning involving an autonomous agent that must control an external environment while learning a control policy that maximize the received reward from such environment. Formally, it can be stated as a Markov Decision Process,, where is a set of environmental states, is a set of actions available to the agent, is an external function that assigns the agent a reward to state transition caused by the agent action taken at any state , is a function that determines the probability that the agent transitions from a state to a state when the action is taken, and finally, is a real number that discounts the values of future rewards.
The control policy is defined as a function , and represents the probability of taking the action when the environment is in state . With , the agent aims to maximize the value function for every state, defined as the expected reward starting from a given state at time-step and following a given policy thereafter. Formally such function must satisfy the Bellman equation [sutton_reinforcement_2018]
. A crucial aspect in RL is the trade-off between exploration and exploitation, in which the agent has to choose between taking actions that are considered to be the best according to the current estimation of the optimal policy learned, or taking actions that are deemed as sub-optimal but makes room for the agent to discover better actions to exploit in the future.
Among basic RL algorithms the most commonly used are Q-Learning [watkins_q-learning_1992] and SARSA [rummery_-line_1994]. Both algorithms compute the action-value function according to a temporal difference between the discounted value of of the next state and action, and the -value for the current state and chosen action. The difference between Q-Learning and SARSA is how they choose the next action , where the latter selects the action based on the policy , and thus it is an on-policy algorithm, whereas the former selects the best estimated action for the resulting state , therefore it is considered as off-policy. Algorithms may also update the values of past states and actions that were responsible for reaching the current state, using a mechanism known as eligibility traces [sutton_reinforcement_2018].
To balance exploitation and exploration in this work, the -greedy policy is used, where the best action is chosen with an probability, and the other alternative actions are chosen at random with a low probability . Alternatively, the Softmax policy is used, where each action is selected based on the equation , where is an hyper-parameter defines the influence of the values in defining the action selection probabilities.
Each of the RL algorithms and policies have their own set of hyper-parameters that must be defined before the agent learning curve begins. Common hyper-parameters includes a learning rate that determines the speed of the convergence of the agent , an exploration rate if the policy is , and a discount factor for future rewards (see [sutton_reinforcement_2018] for a more detailed description). If the policy used is -greedy, an additional hyper-parameter known -decay rate can be used, that reduces the value of after an episode, in order to lower the exploration rate of the agent after a given number of episodes has been experienced.
3 Two-tier hierarchical Bayesian optimization of RL hyper-parameters
In this work, a method that employs Bayesian optimization to perform a two-tier optimization of both structural and solution-level hyper-parameters of an RL agent is proposed. The objective function proposed in this work, maps a set of both structural and solution level hyper-parameters of the algorithm to a real number that measures the overall performance of the learning agent, assuming an episodic task. A novel aspect of this approach is that it combines Bayesian optimization for both categorical and real-valued hyper-parameters. For the categorical hyper-parameters, Bayesian optimization of discrete structures (BOCS) [baptista_bayesian_2018] was used, where the categorical hyper-parameters are taken as binary variables and the maximum of the acquisition function is found through simulated annealing, instead of relying on Gaussian process regression [baptista_bayesian_2018]. On the other hand, the RLOpt [barsce_towards_2018] approach was used for the real-valued hyper-parameters. As the distribution of theand a co-variance function . In order to calculate the value of for , an RL agent is instantiated in a certain environment with hyper-parameters , and it is set to run for a certain number of episodes in order to learn a policy to behave in such a way to maximize its received reward. Whenever the agent is assigned a new vector, it resets all its prior knowledge about the policy in order to make a fresh start, unbiased by the prior hyper-parameter settings. The instance where an RL agent runs a certain number of episodes under the same hyper-parameter setting is called a meta-episode.
This method involves the assumption that RL parameters are related in a two-level hierarchy that takes into account their levels of abstraction. In such relationship, algorithm hyper-parameters such as the exploration policy (e.g. Softmax or -greedy [sutton_reinforcement_2018]) are in a higher level of abstraction than the solution-level hyper-parameters (e.g. the temperature or the exploration rate ), by the fact that the former establishes the possible values for the latter. Following such an assumption, in the proposed method the structural algorithm hyper-parameters are optimized first, while using a set of prior algorithm hyper-parameters, and storing the pairs of point and its corresponding output in the initial set . Once a certain number of meta-episodes are elapsed, the best structural hyper-parameters are kept frozen and the optimization of the hyper-parameters dependent on such algorithms is started, storing its results in the set . The method for such optimization is stated in Algorithm 1.
4 Computational experiments
The proposed approach is validated in a discretized version of the classic Cart-pole control environment, which consists of an environment with a cart that moves either left or right, and it is holding a pole that can swing in both directions. The objective for the cart is to keep the pole balanced (i.e. by not letting it in a position where it will fall to the floor), while maintaining itself within certain limits. Each episode is terminated whether the pole position is above or below 12 degrees from the vertical position, when the cart moves beyond a distance of 2.4 units from the center, or when 200 time-steps have elapsed. A reward of +1 is given after every time-step when the pole is still maintaned upright, and a reward of -200 is assigned to the agent whenever the pole has fallen. The implementation used for the environment was the OpenAI Gym implementation [brockman_openai_2016].
The proposed Algorithm 1 is compared against two of the most common methods for hyper-parameter tuning: random search and Bayesian optimization. To optimize RL hyper-parameters with the latter, the RLOpt framework [barsce_towards_2018] is used. A total number of 30 meta-episodes were used for the three approaches, where the average reward was used to compute on each meta-episode. In the proposed algorithm, 10 meta-episodes were used to optimize the discrete hyper-parameters and 20 meta-episodes were used to optimize the real-valued hyper-parameters once the structural hyper-parameters were fixed. The structural hyper-parameters optimized were , , and (it only applies when -greedy policy is selected). On the other hand, the algorithm hyper-parameters optimized were , , , the number of bins that divides the cart position and speed, , and the number of bins used to discretize the pole angle position and its speed, .
, where the thick lines and their nearby curves correspond to the average and the 95% confidence interval for ten simulations with different random seeds. As can be seen, the proposed method is consistently better at finding the average set of hyper-parameters that reach the maximum than the other two methods that does not optimize the structural hyper-parameters. In Fig.2, it can be appreciated that the proposed method starts reducing its average cumulative reward after having the very first initial convergence where the maximum was found for the ten executions. The average execution time was 7, 8 and 12 minutes for the random search, RLOpt, and for the proposed optimizer, respectively.
5 Concluding remarks
In this work, a novel approach that involved the optimization of both categorical and real-valued RL hyper-parameters, assuming a hierarchical relationship between them was presented. The validation in the Cart-pole environment highlights that the proposed approach performs consistently better than the monolithic optimization of the real-valued hyper-parameters alone. Our current research efforts are focused on including the extension of the concept of a hierarchical relationship among many hyper-parameters, the optimization of complex computational structures such as deep neural networks, and the use of methods such as power analysis in order to determine whether the sample size of meta-episodes must be increased, among others.