Towards White-box Benchmarks for Algorithm Control

by   André Biedenkapp, et al.
University of Freiburg

The performance of many algorithms in the fields of hard combinatorial problem solving, machine learning or AI in general depends on tuned hyperparameter configurations. Automated methods have been proposed to alleviate users from the tedious and error-prone task of manually searching for performance-optimized configurations across a set of problem instances. However there is still a lot of untapped potential through adjusting an algorithm's hyperparameters online since different hyperparameters are potentially optimal at different stages of the algorithm. We formulate the problem of adjusting an algorithm's hyperparameters for a given instance on the fly as a contextual MDP, making reinforcement learning (RL) the prime candidate to solve the resulting algorithm control problem in a data-driven way. Furthermore, inspired by applications of algorithm configuration, we introduce new white-box benchmarks suitable to study algorithm control. We show that on short sequences, algorithm configuration is a valid choice, but that with increasing sequence length a black-box view on the problem quickly becomes infeasible and RL performs better.


page 1

page 2

page 3

page 4


MementoML: Performance of selected machine learning algorithm configurations on OpenML100 datasets

Finding optimal hyperparameters for the machine learning algorithm can o...

Automated Algorithm Selection for Radar Network Configuration

The configuration of radar networks is a complex problem that is often p...

No More Pesky Hyperparameters: Offline Hyperparameter Tuning for RL

The performance of reinforcement learning (RL) agents is sensitive to th...

Bayesian Generational Population-Based Training

Reinforcement learning (RL) offers the potential for training generally ...

DACBench: A Benchmark Library for Dynamic Algorithm Configuration

Dynamic Algorithm Configuration (DAC) aims to dynamically control a targ...

Tuning Mixed Input Hyperparameters on the Fly for Efficient Population Based AutoRL

Despite a series of recent successes in reinforcement learning (RL), man...

Auto-Model: Utilizing Research Papers and HPO Techniques to Deal with the CASH problem

In many fields, a mass of algorithms with completely different hyperpara...

1 Introduction

To achieve peak performance of an algorithm, it is often crucial to tune its Hyperparameters. Manually searching for performance-optimizing Hyperparameter configurations is a complex and error prone task. General algorithm configuration tools [Ansótegui et al.2009, Hutter et al.2011, López-Ibáñez et al.2016] have been proposed to free users from the manual search for well-performing Hyperparameters. Such tools have been successfully applied to state-of-the-art solvers of various problem domains, such as mixed integer programming [Hutter et al.2010], AI planning [Fawcett et al.2011], machine learning [Snoek et al.2012], or propositional satisfiability solving [Hutter et al.2017].

One drawback of algorithm configuration, however, is that it only yields a fixed configuration that is used during the entire solution process of the optimized algorithm. It does not take into account that most algorithms used in machine learning, satisfiability solving (SAT), AI-planning, reinforcement learning or AI in general are iterative in nature. Thereby, these tools ignore the possible induced non-stationarity of the optimal target Hyperparameter configuration.

We propose a general framework to learn to control algorithms which we dub algorithm control

. We formulate the problem of learning dynamic algorithm control policies wrt its Hyperparameters as a contextual Markov decision process (MDP) and apply reinforcement learning to it. Prior work that considered online tuning of algorithms did not explicitly take problem Instances into account 

[Battiti and Campigotto2011] and did not pose this problem as a reinforcement learning problem [Adriaensen and Nowé2016]. To address these missing, but important components, we introduce three new white-box benchmarks suitable for algorithm control. On these benchmarks we show that, using reinforcement learning, we are able to successfully learn dynamic configurations across instance sets directly from data, yielding better performance than static configurations.

Specifically, our contributions are as follows:

  1. We describe controlling algorithm Hyperparameters as a contextual MDP, allowing for the notion of instances;

  2. We show that black-box algorithm configuration is a well-performing option for learning short policies;

  3. We demonstrate that, with increasing policy length, even in the homogeneous setting, traditional algorithm configuration becomes in-feasible;

  4. We propose three new white-box benchmarks that allow to study algorithm control across instances;

  5. We demonstrate that we can learn dynamic policies across a set of instances showing the robustness of applying RL for algorithm control.

2 Related Work

Since algorithm configuration by itself struggles with heterogeneous instance sets (in which different configurations work best for different instances), it was combined with algorithm selection [Rice1976] to search for multiple well-performing configurations and select which of these to apply to new instances [Xu et al.2010, Kadioglu et al.2010]

. For each problem instance, this more general form of per-instance algorithm configuration still uses fixed configurations. However for different AI applications, dynamic configurations can be more powerful than static ones. A prominent example for Hyperparameters that need to be controlled over time is the learning rate in deep learning: a static learning rate can lead to sub-optimal training results and training times 

[Moulines and Bach2011]. To facilitate fast training and convergence, various learning rate schedules or adaptation schemes have been proposed [Schaul et al.2013, Kingma and Welling2014, Singh et al.2015, Daniel et al.2016, Loshchilov and Hutter2017]. Most of these methods, however, are not data-driven.

In the context of evolutionary algorithms, various online hyperparameter adaptation methods have been proposed 

[Karafotias et al.2015, Doerr and Doerr2018]

. These methods, however, are often tailored to one individual problem or rely on heuristics. These adaptation methods are only rarely learned in a data-driven fashion 

[Sakurai et al.2010].

Reactive search [Battiti et al.2008] uses handcrafted heuristics to adapt an algorithms parameters online. To adapt such heuristics to the task at hand, hyper-reactive search [Ansótegui et al.2017] parameterizes these heuristic and applies per-instance algorithm configuration.

The work we present here can be seen as orthogonal to work presented under the heading of learning to learn [Andrychowicz et al.2016, Li and Malik2017, Chen et al.2017]. Both lines of work intend to learn optimal instantiations of algorithms during the execution of said algorithm. The goal of learning to learn, however, is to learn an update rule in the problem space directly whereas the goal of algorithm control is to indirectly influence the update by adjusting the Hyperparameters used for that update. By exploiting existing manually-derived algorithms and only controlling their hyperparameters well, algorithm control may be far more sample efficient and generalize much better than directly learning algorithms entirely from data.

3 Algorithm Control

In this section we show how algorithm control (i.e., algorithm configuration per time-step) can be formulated as a sequential decision making process. Using this process, we can learn a policy to configure An Algorithm’s Hyperparameters on the fly, using reinforcement learning (RL).

3.1 Learning to Control Algorithms

We begin by formulating algorithm control as a Markov Decision Process (MDP) . An MDP is a 4-tuple, consisting of a state space , an action space , a transition function and a reward function .

State Space

At each time-step , in order to make informed choices about the Hyperparameter values to choose, the Controller needs to be informed about the internal state of the Algorithm being controlled. Many Algorithms collect various statistics that are available at each time-step. For example, a SAT solver might track how many clauses are satisfied at the current time-step. Such statistics are suitable to inform the Controller about the current behaviour of the Algorithm.

Action Space

Given a state , the Controller has to decide how to change the value of a Hyperparameter or directly assign a value to that Hyperparameter, out of a range of valid choices. This gives rise to the overall action space for Hyperparameters of the algorithm at hand.

Transition Function

The transition function describes the dynamics of the system at hand. For example, the probability of reaching state

after applying action in state can be expressed as . For simple algorithms and a small instance space, it might be possible to derive the transition function directly from the source code of the Algorithm. However, we assume that the transition function cannot not be explicitly modelled for interesting algorithms. Even if the dynamics are not modelled, RL can be used to learn an optimizing policy directly from observed transitions and rewards.

Reward Function

In order for the Controller to learn which actions are better suited for a given state, the Controller receives a reward signal

. On many RL domains the reward is sparse, i.e., only very few state-action pairs result in an immediate reward signal. If an algorithm already estimates the distance to some goal state well, such statistics might be suitable candidates for the reward signal, with the added benefit that such a reward signal is dense.

Learning policies

Given the MDP the goal of the Controller is to search for a policy such that


where is the action-value function, giving the expected discounted future reward, starting from state , applying action and following policy with discounting-factor .

3.2 Learning to Control across Instances

Algorithms are most often tasked with solving varied problem Instances from the same, or similar domains. Searching for well performing Hyperparameter settings on only one Instance might lead to a strong performance on that Instance but might not generalize to new Instances. In order to facilitate generalization of algorithm control, we explicitly take problem Instances into account. The formulation of algorithm control given above does not take instances into account, treating the problem of finding well performing Hyperparameters as independent of the problem Instance.

To allow for algorithm control across instances, we formulate the problem as a contextual Markov Decision Process , for a given Instance . This notion of context induces multiple MDPs with shared action and state spaces, but with different transition and reward functions. In the following, we describe how the context influences the parts of the MDP.


The Controller’s goal is to learn a policy that can be applied to various problem Instances out of a set of Instances . We treat the Instance at hand as context to the MDP. Figure 1 outlines the interaction between Controller and Algorithm in that setting. Given An Instance , at time-step , the Controller applies action to the Algorithm, i.e., setting Hyperparameter to value . Given this input, the Algorithm advances to state producing a reward signal , based on which the Controller will make its next decision. The Instance stays fixed during the Algorithm run.


control of



apply action



Figure 1: Control of Hyperparameter of An Algorithm on a given contextual Instance , at time-step . Until An Instance is solved or a maximum budget reached, the Controller decides which value to apply to Hyperparameter based on the internal state of the Algorithm, on the given Instance .

State and Action spaces

The space of possible states does not change when switching between Instances from the same set, and is shared between all MDPs induced by the context. Thus we consider the same state features. To enrich the state space, we could also add Instance-specific information, so-called instance features such as problem size, which could be useful in particular for heterogeneous instance sets [Leyton-Brown et al.2009, Schneider and Hoos2012].

Similar to the state space, the action space stays fixed for all MDPs induced by the context. The action space solely depends on the Algorithm at hand and is thus shared across all MDPs of the same context.

Transition Function

Contrary to the state and action space, the transition function is influenced by the choice of the Instance. For example, a search algorithm might be faced with completely different search spaces where applying an action could lead to different kind of states.

Reward Function

As the transition function depends on the Instance at hand, so does the reward function. Depending on the Instance, transitions beneficial for the Controller on one Instance might become unfavorable or might punish the agent on another Instance.

It is possible to choose a proxy reward function that is completely independent of the context, i.e., a negative reward for every step taken. This would incentivize the Controller to learn a policy to quickly solve An Instance which would be interesting if the real objective is to minimize runtime. However, a Controller using such a reward would potentially take very long to learn a meaningful policy as the reward would not help it to easily distinguish between two observed states.

Learning policies across instances

Given the MPD and a set of Instances the goal of the Controller is to find a policy  such that


where is the action-value function, giving the expected discounted future reward, starting from , applying action , following policy on Instance with discounting-rate .

Relation to Algorithm Configuration and Selection

This formulation of algorithm control allows to recover algorithm configuration (AC) as a special case: in AC, the optimal policy would simply always return the same action, for each state and instance. Further, this formulation also allows to recover per-instance algorithm configuration (PIAC) as a special case: in PIAC, the policy would always return the same action for all states, but potentially different actions across different instances. Finally, algorithm selection (AS) is a special case of PIAC with a 1-dimensional categorical action space that merely chooses out of a finite set of algorithms.

4 Benchmarks

To study the algorithm control setting we use two benchmarks already proposed by adriaensen-ijcai16 (adriaensen-ijcai16) and introduce three new benchmarks. Our proposed benchmarks increase the complexity of the optimal policy by either increasing the action space and policy length or including instances.


The first benchmark introduced by adriaensen-ijcai16 (adriaensen-ijcai16) requires an agent to learn a monotonically increasing sequence. The agent only receives a reward if the chosen action has been selected at the corresponding time-step. This requires the agent to learn to count, where the size of the action space is equal to the sequence length. In the original setting of adriaensen-ijcai16 (adriaensen-ijcai16), agents need to learn to count to five, with the optimal policy resulting in a reward of five. The state is simply given by the history of the actions chosen so far.


The second benchmark introduced by adriaensen-ijcai16 (adriaensen-ijcai16) only features two actions. Action returns a fuzzy reward signal drawn from , whereas playing action terminates the sequence prematurely. The maximum sequence length used in adriaensen-ijcai16 (adriaensen-ijcai16) is with an expected reward of the optimal policy also being . Similar to the previous benchmark, Fuzzy does not include any state representation other than a history over the actions.


Similar to the already presented benchmarks, the newly proposed Luby (see Benchmark Outline 1) does not model instances explicitly. However, it increases the complexity of learning a sequence compared to the benchmarks by adriaensen-ijcai16 (adriaensen-ijcai16). An agent is required to learn the values in a Luby sequence [Luby et al.1993], which is, for example, used for restarting SAT solvers. The sequence is ; formally, the -th value in the sequence can be computed as:


This gives rise to an action space for sequences of length with for all time-steps , with the action values giving the exponents used in the Luby sequence. For such a sequence, an agent can benefit from state information about the sequence, such as the length of the sequence. For example, imagine an agent has to learn the Luby sequence for length . Before time-step the action value would never have to be be played. For a real algorithm to be controlled, such a temporal feature could be encoded by the iteration number directly or some other measure of progress. The state an agent can observe therefore consists of such a time feature and a small history over the five last selected actions.

Actions: for all ;
States: ;
for  do
       if  then
       end if
end for
Benchmark Outline 1 Luby


Benchmark Sigmoid (see Benchmark Outline 2) allows to study algorithm control across instances. Policies depend on the sampled instance , which is described by a sigmoid that can be characterized through its inflection point and scaling factor . The state is constructed using a time feature, as well as the instance information and .

At each time-step an agent has to decide between two actions. The received reward when playing action is given by the function value of the sigmoid at time-step and

otherwise. The scaling of the sigmoid function is sampled uniformly at random in the interval

. The sign of the scaling factor determines if an optimal policy on the instance should begin by selecting action or . The inflection point is distributed according to and determines how often an action has to be repeated before switching to the other action. Figure 2 depicts rewards for two example instances. The sigmoid in Figure (a)a is unshifted and unscaled, leading to an optimal policy of playing action for the first half of the sequence and for the rest of the sequence. In Figure (b)b the sigmoid is shifted to the left such that the inflection point is at and scaled by factor . The optimal policy in this case is to play action for the first three steps and for the rest of the sequence.


control of



control of

Figure 2: Example rewards for Benchmark 2 with on both instances, on instance LABEL:sub@fig:sig_A and on instance LABEL:sub@fig:sig_B. The solid line shows the received reward when playing action and the dashed line gives the reward for action . On the -axis are the time steps and on the -axis is the reward. On instance LABEL:sub@fig:sig_A it is preferable to select action for the first halve of the sequence whereas on instance LABEL:sub@fig:sig_B it is better to start with action .
Actions: for all ;
States: ;
for  do
       if  then
       end if
end for
Benchmark Outline 2 Sigmoid


Benchmark SigmoidMVA (see Benchmark Outline 3) further increases the complexity of learning across instances by translating the setting of Sigmoid into a multi-valued action setting. An agent not only has to learn a simple policy switching between two actions but to learn to follow the shape of the sigmoid function used to compute the reward. The available actions an agent can choose from at each time-step are . Note that, depending on the granularity of the discretization (determined by ) the agent can follow the sigmoid more or less closely (thereby directly affecting its reward).

Actions: for all ;
States: ;
for  do
end for
Benchmark Outline 3 SigmoidMVA

5 Algorithms to be Considered

In this section we discuss the agents we want to evaluate for the task of algorithm control. We first discuss how to apply standard black-box optimization to the task of algorithm control. We then present agents that are capable of taking state information into account.

5.1 Black-Box Optimizer

In a standard black-box optimization setting, the optimizer interacts with an intended target by setting the configuration of the target at the beginning and waiting until the target returns the final reward signal. This is, e.g., the case in algorithm configuration. The same setup can be easily extended to search for sequences of configurations for online configuration of the target. Instead of setting a Hyperparameter once, the optimizer would have to set a sequence of Hyperparameter values, once per time-step at which the target should switch its configuration. For sequences with such change points and large , this drastically increases the configuration space, since the optimizer would need to treat each individual parameter as different Hyperparameters. In addition, black-box optimizers cannot observe the state information, which are required to learn instance-specific policies.

5.2 Context-oblivious Agents

As a proof-of-concept adriaensen-ijcai16 (adriaensen-ijcai16) introduce context-oblivious agents that can take state information into account when selecting which action to play next. In their experiments the only state information they took into account was the history of the actions.

To move their proposed agents from a black-box setting towards a white-box setting, during training the agents keep track of the number of times an action lead from one state to another, as well as the average reward this transition produced. This tabular approach limits the agents to small state and action spaces. The proposed agents include:

  • URS: Selects an action uniformly at random.

  • PURS: Selects a previously not selected action uniformly at random. Otherwise, actions are selected in proportion to the expected number of remaining steps.

  • GR: Selects an action greedily based on the expected future reward.

During the evaluation phase, all agents greedily select the best action given the observations recorded during training.

URS and GR both are equivalent to the two extremes of -greedy Q-learning [Watkins and Dayan1992], with and respectively. PURS leverages information about the expected trajectory length, but it does not include the observed reward signal in the decision making process. For tasks where every execution path has the same length (e.g. Counting, Luby, Sigmoid and SigmoidMVA), PURS would fail to produce a policy other than a uniform random one. Further, when using PURS, we need to have some prior knowledge if shorter or longer trajectories should be preferred. For example on benchmarks like Fuzzy, PURS is only able to find a meaningful policy if we know that longer sequences produce better rewards.

5.3 Reinforcement Learning

Reinforcement learning (RL) is a promising candidate to learn algorithm control policies in a data driven fashion because we can formulate algorithm control as an MDP and we can sample a large number of episodes given enough compute resources. An RL agent repeatedly interacts with the target algorithm by choosing some configurations at a given time-step and observing the state transition as well as the reward. Then, the RL agent updates its believe state about how the target algorithm will behave when using the chosen configuration at that time-step. Through these interactions, over time, the agent can typically find a policy that yields higher rewards. For small action and state spaces, RL agents can be easily implemented using table lookups, whereas for larger spaces, function approximation methods can make learning feasible. We evaluated -greedy Q-learning [Watkins and Dayan1992] in the tabular setting as well as DQN using function approximations [Mnih et al.2016].

6 Experimental Study

To compare black-box optimizers, context-oblivious agents [Adriaensen and Nowé2016] and reinforcement learning agents for algorithm control, we evaluated various agents on the benchmarks discussed above.

(a) Counting
(b) Fuzzy
(c) Luby
Figure 3: Results of SMAC, URS and tabular -greedy Q-learning, on a set of discrete benchmarks. The -axis depicts the number of episodes seen during training and the

-axis the gained reward. The lines depict the gained reward for each agent when evaluating it after the given number of training episodes with the solid line representing the mean reward over 25 repetitions and the shaded area the standard error. The presented lines are smoothed over a window of size 10. The results in

LABEL:sub@fig:count are obtained on Counting. The results in LABEL:sub@fig:fuzzy stem from Fuzzy and the results in LABEL:sub@fig:luby depict the results on Luby.

6.1 Setup

We used SMAC [Hutter et al.2011] in version 3 (SMACv3 [Lindauer et al.2017]) as a state-of-the-art algorithm configurator and black-box optimizer. We implemented URS using simple tabular -greedy Q-learning. We decided against using PURS as it would only be applicable to Fuzzy, see Section 5.2.

Q-learning based approaches were evaluated using a discounting factor of . On benchmarks with stochastic reward we set the learning rate to and to otherwise. The -greedy agent was trained using a constant .

As Sigmoid has continuous state features we include Q-learning using function approximation in the form of a DQN [Mnih et al.2013] implemented in RLlib [Liang et al.2018]. We used the default configuration of the DQN in RLlib (0.6.6), i.e., a double dueling DQN where the target network is updated every 5 episodes and the exploration fraction of the DQN is linearly decreased from to . We only changed the number of hidden units to 50 and the training batch size and the timesteps per training iteration to the episode length such that in each training iteration only one episode is observed.

In each training iteration each agent observed a full episode. Training runs for all methods were repeated 25 times using different random seeds and each agent was evaluated after updating its policy. When evaluating on the deterministic benchmarks (Counting and Luby) only one evaluation run was performed. On the other benchmarks we performed evaluation runs of which we report the mean reward. When using a fixed instance set of size on Sigmoid and SigmoidMVA, we evaluated the agents once on each instance.

All experiments were run on a compute cluster with nodes equipped with two Intel Xeon E5-2630v4 and 128GB memory running CentOS 7. The results on the benchmarks that do not model problem instances (Counting, Fuzzy and Luby) are plotted in Figure 3. The results for benchmarks with Instances (Sigmoid and SigmoidMVA) are shown in Figures 4 and 5.

6.2 Results


Figure (a)a shows the evaluation results of SMAC, URS, as well as an -greedy agent on Counting. The agents are tasked with learning a policy of length with for all . On this simple benchmark, SMAC outperforms both other methods and learns the optimal policy after observing approximately episodes. This is in contrast to adriaensen-ijcai16 (adriaensen-ijcai16), where on this benchmark they evaluated black-box optimization for a static policy producing constant reward . The -greedy agent quickly learns policies in which out of choices are set correctly but requires to observe approximately episodes until it learns the optimal policy. URS purely exploratory behaviour prohibits quick learning of simple policies, requiring close to episodes until it recovers the optimal policy.


The results of the agents’ behaviours on Fuzzy with are presented in Figure (b)b and extend the findings of adriaensen-ijcai16 (adriaensen-ijcai16). In such a noisy setting, -greedy Q-learning is faster than SMAC in learning the optimal policy, approaching it after roughly episodes. However, SMAC is still able to learn the optimal policy after more than episodes. URS has still not learned the optimal policy after episodes, only learning a policy that chooses action approximately times before choosing action .


Learning the optimal policy for Luby requires the agent to learn a policy of length with for all . The -greedy agent already learns the optimal policy after observing about episodes. In roughly the same amount of episodes SMAC found a policy in which half of the choices are set correctly, and after observing episodes it is able to find a policy that selects roughly of the actions correctly. URS was roughly times slower in learning a policy that achieves a reward of , selecting half of the actions correctly after roughly episodes (see Figure (c)c); it found its final performance 100 times slower than SMAC.

(a) Sigmoid
(b) Sigmoid Traing
(c) Sigmoid Test
Figure 4: Comparison of SMAC, URS and tabular -greedy Q-learning, and a DQN on Sigmoid. The -axis depicts the number of episodes seen during training and the -axis the gained reward. The lines depict the gained reward for each agent when evaluating it after the given number of training episodes with the solid line representing the mean reward over 25 training repetitions and the shaded area the standard error. To estimate the performance over the distribution of instances in LABEL:sub@fig:sig_A, we sample 10 random sigmoid functions when evaluating the agents. In the case of LABEL:sub@fig:sig_B we evaluate the agents on all training instances. The presented lines are smoothed over a window of size 10. LABEL:sub@fig:sig depicts results over a distribution of instances, LABEL:sub@fig:siginst depicts results over a fixed set of training instances and LABEL:sub@fig:siginsttest the results on prior unseen test instances, evaluated every training episodes. For the tabular approaches the sate values have been rounded to the closest integer.


Results considering instances are shown in Figure 4. In this setting the agents have to learn to adapt their policies to the presented Instance, with each policy of length and . For each episode an instance can be either directly sampled or taken out of a set of instances stemming from the same distribution. Therefore an agent that learns policies dependent on the task can achieve a maximal reward of and black-box optimizers can only achieve at most . To allow the tabular Q-learning approaches to work on this continuous state-space we round the scaling factor and inflection point to the closest integer values.

Figure (a)a shows the gained reward of the agents when randomly drawing new instances in each training iteration. We can observe that all agents received a reward of roughly for randomly selecting which actions to play. The DQN quickly began to learn faster than either of the tabular approaches, receiving a reward of before -greedy and URS begin to learn an improving policy. After roughly training episodes the DQN learns policies that adapt to the instance at hand, whereas the -greedy agent gets stuck in a local optimum. Due to being completely exploratory, URS does not exhibit the same behaviour and can continue to improve its policy before learning the optimal policy after roughly training episodes. SMAC is unable to find a policy that is able to adapt to the instances at hand. This is due to the optimizer not being able to distinguish between a positive and negative slope of the sigmoid. Therefore, it cannot decide if it should start a policy with action or before switching to the other. Furthermore, the agent does not know the inflection point and can only guess when to switch from one action to the other.

It is most often not the case that we have an entire distribution of instances at our disposal, but only a finite set of instances sampled from an unknown distribution. To include this setting in our evaluation, we sampled training and test instances from the same distribution used before. The reported results here give the performance across this whole training or test set. On the training set, the results for the DQN as well as SMAC look very similar to the results for the distribution of instances. However, both tabular agents learn much faster since the possible state-space is much smaller. Similarly to the results on the distribution of instances, the -greedy agent gets stuck in a local optimum for some time before escaping it again. The purely exploratory approach of URS prevents it from getting stuck in local optima.

On the test instances, the tabular agents are incapable of generalization (see Figure (c)c), but, using function approximation, DQN is able to generalize. At first, DQN overfits on a few training instances, before it learns a robust policy for many training instances that generalizes to the test instances.

Figure 5: Comparison of the agents on SigmoidMVA. The and -axis show the number of episodes and the gained reward respectively. The lines depict the reward for each agent when evaluating it after the given number of training episodes where the line represents the mean reward over 25 training repetitions and the shaded area the standard error, smoothed over a window of size 10.


The results on SigmoidMVA are shown in Figure 5. Similar to Sigmoid, agents need to adapt their policy to a sampled instance, however, on an extended action space of size . Again URS benefits from its random sampling behavior, whereas the -greedy agent needs to observe roughly episodes before improving over a random policy. Without any state information SMAC struggles to find a meaningful policy and the DQN is capable of adjusting the policy to the instance at hand even on this higher-dimensional space.

7 Conclusion

To the best of our knowledge we are the first to formalize the algorithm control problem as a contextual MDP, explicitly taking problem Instances into account. To study different agents types for the problem of algorithm control with instances, we present new white-box benchmarks. Using these benchmarks, we showed that black-box optimization is a feasible candidate to learn policies for simple action spaces. With increasing complexity of the optimal policy however, black-box optimizers struggle to learn such an optimal policy. In contrast, reinforcement learning is a suitable candidate for learning more complex sequences. If heterogeneous instances are considered, black-box optimizers might struggle to learn any policy that is better than a random policy. In contrast, RL agents making use of state information are able to adapt their policies to the problem instance, which demonstrates the potential of applying RL to algorithm control.

The presented white-box benchmarks are a first step towards scenarios resembling real algorithm control for hard-combinatorial problem solvers on a set of instances. In future work, we plan to extend our benchmarks considering mixed spaces of categorical and continuous hyperparameters and conditional dependencies. Furthermore, we plan to train cheap-to-evaluate surrogate benchmarks based on data gathered from real algorithm runs [Eggensperger et al.2018].


The authors acknowledge funding by the Robert Bosch GmbH, support by the state of Baden-Württemberg through bwHPC and the German Research Foundation (DFG) through grant no. INST 39/963-1 FUGG.


  • [Adriaensen and Nowé2016] S. Adriaensen and A. Nowé. Towards a white box approach to automated algorithm design. In Proc. of IJCAI’16, pages 554–560, 2016.
  • [Andrychowicz et al.2016] M. Andrychowicz, M. Denil, S. Gomez, M. W. Hoffman, D. Pfau, T. Schaul, B. Shillingford, and N. De Freitas. Learning to learn by gradient descent by gradient descent. In Proc. of NIPS’16, pages 3981–3989, 2016.
  • [Ansótegui et al.2009] C. Ansótegui, M. Sellmann, and K. Tierney.

    A gender-based genetic algorithm for the automatic configuration of algorithms.

    In Proc. of CP’09, pages 142–157, 2009.
  • [Ansótegui et al.2017] C. Ansótegui, J. Pon, M. Sellmann, and K. Tierney. Reactive dialectic search portfolios for maxsat. In Proc. of AAAI’17, 2017.
  • [Battiti and Campigotto2011] R. Battiti and P. Campigotto. An investigation of reinforcement learning for reactive search optimization. In Autonomous Search, pages 131–160. Springer, 2011.
  • [Battiti et al.2008] Roberto Battiti, Mauro Brunato, and Franco Mascia. Reactive search and intelligent optimization, volume 45. Springer Science & Business Media, 2008.
  • [Chen et al.2017] Y. Chen, M. W. Hoffman, S. G. Colmenarejo, M. Denil, T. P. Lillicrap, M. Botvinick, and N. De Freitas. Learning to learn without gradient descent by gradient descent. In Proc. of ICML’17, pages 748–756, 2017.
  • [Daniel et al.2016] C. Daniel, J. Taylor, and S. Nowozin.

    Learning step size controllers for robust neural network training.

    In Proc. of AAAI’16, 2016.
  • [Doerr and Doerr2018] B. Doerr and C. Doerr. Theory of parameter control for discrete black-box optimization: Provable performance gains through dynamic parameter choices. arXiv:1804.05650, 2018.
  • [Eggensperger et al.2018] K. Eggensperger, M. Lindauer, H. H. Hoos, F. Hutter, and K. Leyton-Brown. Efficient benchmarking of algorithm configurators via model-based surrogates. Machine Learning, 107(1):15–41, 2018.
  • [Fawcett et al.2011] C. Fawcett, M. Helmert, H. Hoos, E. Karpas, G. Roger, and J. Seipp. Fd-autotune: Domain-specific configuration using fast-downward. In Proc. of ICAPS’11, 2011.
  • [Hutter et al.2010] F. Hutter, H. Hoos, and K. Leyton-Brown. Automated configuration of mixed integer programming solvers. In Proc. of CPAIOR’10, pages 186–202, 2010.
  • [Hutter et al.2011] F. Hutter, H. Hoos, and K. Leyton-Brown. Sequential model-based optimization for general algorithm configuration. In Proc. of LION’11, pages 507–523, 2011.
  • [Hutter et al.2017] F. Hutter, M. Lindauer, A. Balint, S. Bayless, H. Hoos, and K. Leyton-Brown. The configurable SAT solver challenge (CSSC). 243:1–25, 2017.
  • [Kadioglu et al.2010] S. Kadioglu, Y. Malitsky, M. Sellmann, and K. Tierney. ISAC - instance-specific algorithm configuration. In Proc. of ECAI’10, pages 751–756, 2010.
  • [Karafotias et al.2015] G. Karafotias, M. Hoogendoorn, and A. E. Eiben. Parameter control in evolutionary algorithms: Trends and challenges.

    IEEE Transactions on Evolutionary Computation

    , 19(2):167–187, 2015.
  • [Kingma and Welling2014] D. Kingma and M. Welling. Auto-encoding variational bayes. In Proc. of ICLR’14, 2014.
  • [Leyton-Brown et al.2009] K. Leyton-Brown, E. Nudelman, and Y. Shoham. Empirical hardness models: Methodology and a case study on combinatorial auctions. Journal of the ACM, 56(4):1–52, 2009.
  • [Li and Malik2017] K. Li and J. Malik. Learning to optimize. In Proc. of ICLR’17, 2017.
  • [Liang et al.2018] E. Liang, R. Liaw, R. Nishihara, P. Moritz, R. Fox, K. Goldberg, J. Gonzalez, M. Jordan, and I. Stoica. Rllib: Abstractions for distributed reinforcement learning. In Proc. of ICML’18, pages 3059–3068, 2018.
  • [Lindauer et al.2017] M. Lindauer, K. Eggensperger, M. Feurer, S. Falkner, A. Biedenkapp, and F. Hutter. Smac v3: Algorithm configuration in python., 2017.
  • [López-Ibáñez et al.2016] M. López-Ibáñez, J. Dubois-Lacoste, L. Perez Caceres, M. Birattari, and T. Stützle. The irace package: Iterated racing for automatic algorithm configuration. Operations Research Perspectives, 3:43–58, 2016.
  • [Loshchilov and Hutter2017] I. Loshchilov and F. Hutter.

    Sgdr: Stochastic gradient descent with warm restarts.

    In Proc. of ICLR’17, 2017.
  • [Luby et al.1993] M. Luby, A. Sinclair, and D. Zuckerman. Optimal speedup of las vegas algorithms. Information Processing Letters, 47(4):173–180, 1993.
  • [Mnih et al.2013] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. Playing atari with deep reinforcement learning. arXiv:1312.5602, 2013.
  • [Mnih et al.2016] V. Mnih, A. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In Proc. of ICML’16, pages 1928–1937, 2016.
  • [Moulines and Bach2011] E. Moulines and F. R. Bach.

    Non-asymptotic analysis of stochastic approximation algorithms for machine learning.

    In Proc. of NIPS’11, pages 451–459, 2011.
  • [Rice1976] J. Rice. The algorithm selection problem. Advances in Computers, 15:65–118, 1976.
  • [Sakurai et al.2010] Y. Sakurai, K. Takada, T. Kawabe, and S. Tsuruta. A method to control parameters of evolutionary algorithms by using reinforcement learning. In Proc. of SITIS, pages 74–79, 2010.
  • [Schaul et al.2013] T. Schaul, S. Zhang, and Y. LeCun. No More Pesky Learning Rates. In Proc. of ICML’13, 2013.
  • [Schneider and Hoos2012] M. Schneider and H. Hoos. Quantifying homogeneity of instance sets for algorithm configuration. In Proc. of LION’12, pages 190–204, 2012.
  • [Singh et al.2015] B. Singh, S. De, Y. Zhang, T. Goldstein, and G. Taylor. Layer-specific adaptive learning rates for deep networks. In Proc. of ICMLA’15, pages 364–368, 2015.
  • [Snoek et al.2012] J. Snoek, H. Larochelle, and R. Adams. Practical Bayesian optimization of machine learning algorithms. In Proc. of NIPS’12, pages 2960–2968, 2012.
  • [Watkins and Dayan1992] C. Watkins and P. Dayan. Q-learning. Machine learning, 8(3-4):279–292, 1992.
  • [Xu et al.2010] L. Xu, H. Hoos, and K. Leyton-Brown. Hydra: Automatically configuring algorithms for portfolio-based selection. In Proc. of AAAI’10, pages 210–216, 2010.