Many high-performance algorithms for solving computationally hard problems, ranging from exact methods such as mixed integer programming solvers to heuristic methods such as local search, involve a large number of free parameters that need to be carefully tuned to achieve their best performance. In many cases, finding performance-optimizing parameter settings is performed manually in an ad-hoc way. However, the manually-tuning approach is often intensive in terms of human effort[López-Ibáñez et al.2016] and thus there are a lot of attempts on automating this process (see [Hutter et al.2009] for a comprehensive review), which is usually referred to as automatic algorithm configuration (AAC) [Hoos2012]. Many AAC methods such as ParamILS [Hutter et al.2009], GGA/GGA+[Ansótegui, Sellmann, and Tierney2009, Ansótegui et al.2015], irace [López-Ibáñez et al.2016] and SMAC [Hutter, Hoos, and Leyton-Brown2011] have been proposed in the last few years. They have been used for boosting the algorithm’s performance in a wide range of domains such as the boolean satisfiability problem (SAT) [Hutter et al.2009], the traveling salesman problem (TSP) [López-Ibáñez et al.2016, Liu, Tang, and Yao2019], the answer set programming (ASP) [Hutter et al.2014]
and machine learning[Feurer et al.2015, Kotthoff et al.2017].
Despite the notable success achieved in application, the theoretical aspects of AAC have been rarely investigated. To our best knowledge, for AAC the first theoretical analysis was given by Birattari2004 Birattari2004, in which the author analyzed expectations and variances of different performance estimators that estimate the true performance of a given parameter configuration on the basis ofruns of the configuration. It is concluded in [Birattari2004] that performing one single run on different problem instances guarantees that the variance of the estimate is minimized, which has served as a guidance in the design of the performance estimation mechanisms in later AAC methods including irace, ParamILS and SMAC. It is noted that the analysis in [Birattari2004] assumes that infinite problem instances could be sampled for configuration evaluation. However, in practice we are often only given a set of finite training instances [Hoos2012].
Recently, KleinbergLL17 KleinbergLL17 introduced a new algorithm configuration framework named Structured Procrastination (SP), which is guaranteed to find an approximately optimal parameter configuration within a logarithmic factor of the optimal runtime in a worst-case sense. Furthermore, the authors showed that the gap between worst-case runtimes of existing methods (ParamILS, GGA, irace, SMAC) and SP could be arbitrarily large. These results were later extended in [Weisz, György, and Szepesvári2018, Weisz, György, and Szepesvári2019], in which the authors proposed new methods, called LEAPSANDBOUNDS (LB) and CapsAndRuns (CR), with better runtime guarantees. However, there is a discrepancy between the algorithm configuration problem addressed by these methods (SP, LB and CR) and the problem that is most frequently encountered in practice. More specifically, these methods are designed to find parameter configurations with approximately optimal performances on the input (training) instances; while in practice it is more desirable to find parameter configurations that will perform well on new unseen instances rather than just the training instances [Hoos2012]. Indeed, one of the most critical issues that needs to be addressed in AAC is the over-tuning phenomenon [Birattari2004], in which the found parameter configuration is with excellent training performance, but performs badly on new instances 111 To appropriately evaluate AAC methods, in the literature, including widely used benchmarks (e.g., AClib [Hutter et al.2014]) and major contests (e.g., the Configurable SAT Solver Challenge (CSSC) [Hutter et al.2017]), the common scheme is to use an independent test set that has never been used during the configuration procedures to test the found configurations..
Based on the above observation, this paper extends the results of [Birattari2004] in several aspects. First, this paper introduces a new formulation of the algorithm configuration problem (Definition 1), which concerns the optimization of the expected performance of the configured algorithm on an instance distribution . Compared to the one considered by Birattari2004 Birattari2004 in which is directly given (thus could be sampled infinitely), in the problem considered here is unknown and inaccessible, and the assumption is that the input training instances (and the test instances) are sampled from . Therefore when solving this configuration problem, we can only use the given finite training instances. One key difficulty is that the true performance of a parameter configuration is unachievable. Subsequently, we could only run a configuration on the training instances to obtain an estimate of its true performance. Thus a natural and important question is that, given finite computational budgets, e.g., runs of the configuration, how to allocate them over the training instances to obtain the most reliable estimate. Moreover, given that we could obtain an estimate of the true performance, is it possible to quantify the difference between the estimate and the true performance?
The second and the most important contribution of this paper is that it answers the above questions theoretically. More specifically, this paper first introduces a universal best performance estimator (Theorem 1) that always distributes the runs of a configuration to all training instances as evenly as possible, such that the performance estimate is most reliable. Then this paper investigates the estimation error, i.e., the difference between the training performance (the estimate) and the true performance, and establishes a bound on the estimation error that holds for all configurations in the configuration space, considering the cardinality of the configuration space is finite (Theorem 2). It is shown that the bound deteriorates as the number of the considered configurations increases. Since in practice the cardinality of the configuration space considered could be considerably large or even infinite, by making two mild assumptions on the considered configuration scenarios, we remove the dependence on the cardinality of the configuration space and finally establish a new bound on the estimation error (Theorem 3).
The effectiveness of these results have been verified in extensive experiments conducted on four configuration scenarios involving problem domains including SAT, ASP and TSP. Some potential directions for improving current AAC methods from these results have also been identified.
Algorithm Configuration Problem
In a nutshell, the algorithm configuration problem concerns optimization of the free parameters of a given parameterized algorithm (called target algorithm) for which the performance is optimized.
Let denote the target algorithm and let be parameters of . Denote the set of possible values for each parameter as . A parameter configuration (or simply configuration) of refers to a complete setting of , such that the behavior of on a given problem instance is completely specified (up to randomization of itself). The configuration space contains all possible configurations of . For brevity, henceforth we will not distinguish between and the instantiation of with . In real application is often randomized and its output is determined by the used configuration , the input instance and the random seed . Let
denote a probability distribution over a spaceof problem instances from which is sampled. Let be a probability distribution over a space of random seeds from which is sampled. In practice is often given implicitly through a random number generator.
Given an instance and a seed , the quality of at is measured by a utility function , where are bounded real numbers. In practice, it means running with on , and maps the result of this run to a scalar score. Note how the mapping is done depends on the considered performance metric. For examples, if we are interested in optimizing quality of the solutions found by , then we might take the (normalized) cost of the solution output by as the utility; if we are interested in minimizing computational resources consumed by (such as runtime, memory or communication bandwidth), then we might take the quantity of the consumed resource of the run as the utility. No matter which performance metric is considered, in practice the value of is bounded for all , i.e., for all and all , .
To measure the performance of , the expected value of the utility scores of across different , which is the most widely adopted criterion in AAC applications [Hoos2012], is considered here. More specifically, as presented in Definition 1, the performance of , denoted as , is its expected utility score over instance distribution and random seed distribution . Without loss of generality, we always assume a smaller value is better for . The goal of the algorithm configuration problem is to find a configuration from the configuration space with the best performance.
Definition 1 (Algorithm Configuration Problem).
Given a target algorithm with configuration space , an instance distribution defined over space , a random seed distribution defined over space and a utility function that measures the quality of at , the algorithm configuration problem is to find a configuration from with the best performance:
In practice, is usually unknown and the analytical solution of is unachievable. Instead, usually we have a set of problem instances , called training instances, which are assumed to be sampled from . To estimate , a series of experiments of on could be run. As presented in Definition 2, an experimental setting to estimate is to run on for times, each time with a random seed sampled from .
Definition 2 (Experimental Setting ).
Given a configuration , a set of training instances and the total number of runs of , an experimental setting to estimate is a list of tuples, in which each tuple consists of an instance and a random seed , meaning a single run of with on . Let denote the number of runs performed on (note could be 0, meaning will not be run on ). It holds that and could be written as:
After performing the runs of as specified in , the utility scores of these runs are aggregated to estimate . The following estimator , which calculates the mean utility across all runs and is widely adopted in AAC methods [Hutter et al.2009, López-Ibáñez et al.2016, Hutter, Hoos, and Leyton-Brown2011], is presented in Definition 3.
Definition 3 (Estimator ).
Given a configuration and an experimental setting , the training performance of , which is an estimate of , is given by:
Since different experimental settings represent different performance estimators, which have different behaviors. It is thus necessary to investigate which is the best.
Universal Best Performance Estimator
To determine the values of in , Birattari2004 Birattari2004 analyzed expectations and variances of , and concluded that with has the minimal variance. It is noted that the analysis in [Birattari2004] assumes that infinite problem instances could be sampled from ; thus for performing runs of , as specified in , it is always the best to sample instances from and perform one single run of on each instance. In other words, is established on the basis that the number of the training instances could always be set equal to . However, in practice usually we only have a finite number of training instances. In the case that , which is the best? Theorem 1 answers this question for arbitrary relationship between and . Before presenting Theorem 1, some necessary definitions are introduced.
Given a configuration and an instance , the expected utility of within , denoted as , is . The variance of the utility of within , denoted as , is . Based on and , the expected within-instance variance of and the across-instance variance of are defined in Definition 4 and Definition 5, respectively.
Definition 4 (Expected within-instance Variance of ).
is the expected value of over instance distribution :
Definition 5 (Across-instance Variance of ).
is the variance of over instance distribution :
The expectation of is , that is,
is an unbiased estimator of
is an unbiased estimator ofno matter how in are set:
The variance of is given by:
Given a configuration , a training set of instances and the total number runs of , the universal best estimator for is obtained by setting for all , s.t. . is an unbiased estimator of and is with the minimal variance among all possible estimators.
By Lemma 1, is an unbiased estimator of . We now prove has the minimal variance. By Lemma 2, the variance of is . Since and are fixed, and and are constants for a given , we need to minimize , s.t. . Define and , it then follows that . Then it suffices to prove that is minimized on the condition for all . Assuming is minimized while the condition not satisfied, then there must exist and , such that ; then we have . This contradicts the assumption that is minimized. The proof is complete. ∎
Theorem 1 states that it is always the best to distribute the runs of to all training instances as evenly as possible, in which case , no matter or . When , is actually equivalent to that performs one single run of on each instance. When , will perform runs of on each of instances and perform runs on each of the rest instances. It is worth mentoring that practical AAC methods including ParamILS, SMAC and irace actually adopt the same or quite similar estimators as . Theorem 1 provides a theoretical guarantee for these estimators, and will be further evaluated in the experiments.
Bounds on Estimation Error
Although Theorem 1 presents the estimator with the universal minimal variance, it cannot provide any information about how large the estimation error, i.e., , could be. Bounds on estimation error are useful in both theory and practice because we could use them to establish bounds on the true performance , given that in algorithm configuration process the training performance is actually known. In general, given a configuration , its training performance
is a random variable because the training instances and the random seeds specified inare drawn from distributions and , respectively. Thus we focus on establishing probabilistic inequalities for , i.e., for any , with probability at least , there holds . In particular, probabilistic bounds on uniform estimation error, i.e., , that hold for all are established. Recalling that Lemma 1 states , the key technique for deriving bounds on is the concentration inequality presented in Lemma 4 that bounds how deviates from its expected value .
Lemma 3 (Bernstein’s Inequality [Bernstein1927]).
Let be independent centered bounded random variables, i.e., and . Let where is the variance of . Then for any we have
Given a configuration , an experimental setting and a performance estimator . Let . Let , where are the lower bound and the upper bound of respectively (see Definition 1), and let . Then for any , we have
Define random variables , and define random variables . First we prove that satisfy the conditions in Lemma 3. . By Definition 1, , it holds that (since ). Thus we have and . For any , and are independent. Thus are independent random variables.
Let . By Lemma 3, it holds that, for any , , where . Notice that ; thus it holds that for any ,
The rest of the proof focuses on . Since , . Substitute with and we have . We analyze and in turn. (by setting in Eq. (1)). . Given an instance , and are independent because and are sampled from . Thus it holds that:
By the fact , . The last step is by Definition 5.
Summing up the above results, we have . Thus . Substitute in Eq. (2) with this result and the proof is complete. ∎
On Configuration Space with Finite Cardinality
Theorem 2 presents the bound for uniform estimation error when is of finite cardinality.
Given a performance estimator . Let , where , and let , and . Let and . Given that is of finite cardinality, i.e., , then for any , with probability at least , there holds:
Note that for different , the bounds on the right side of Eq. (2) are different. The proof of Theorem 1 shows that , s.t. , is minimized on the condition for all . Moreover, it is easy to verify that is also minimized on the same condition, in which case . Thus we can immediately obtain Corollary 1.
On Configuration Space with Infinite Cardinality
Since in practice the cardinality of could be considerably large (e.g., ), in which case the bound provided by Theorem 2 could be very loose. Moreover, when the cardinality of is infinite, Theorem 2 does not apply anymore. To address these issues, we establish new uniform error bound without dependence on the cardinality of based on two mild assumptions given below.
We assume there exists such that , where is a ball of radius and for .
We assume for any , the utility function is L-Lipschitz continuous, i.e., for all .
Part (a) of Assumption 1 means the ranges of the values of all parameters considered are bounded, which holds in nearly all practical algorithm configuration scenarios [Hutter et al.2014]. Part (b) of Assumption 1 poses limitations on how fast can change across . This assumption is also mild in the sense that it is expected that configurations with similar parameter values would result in similar behaviors of , thus getting similar performances. The key technique for deriving the new bound is covering numbers as defined in Definition 6.
Let be a set and be a metric. For any , a set is called an -cover of if for every there exists an element satisfying . The covering number is the cardinality of the minimal -cover of :
Lemma 5 presents a covering number bound on .
Lemma 5 ([Gilles1999]).
Since , it is easy to verify that . Based on the L-Lipschitz continuity assumption, Lemma 6 establishes a bound for , where .
Let and . If Assumption 1 holds, then .
For any , by the Lipschitz continuity we know . Then, any -cover of w.r.t. would imply an -cover of w.r.t. . This together with Lemma 5 implies the stated result. The proof is complete. ∎
|SATenstein-QCP||SATenstein [KhudaBukhsh et al.2016], h = 54||SAT||Randomly selected from QCP [Gomes and Selman1997]||500||500||5s|
|clasp-weighted-sequence||clasp [Gebser et al.2007], h=98||ASP||”small” type weighted-sequence [Lierler et al.2012]||500||120||25s|
|LKH-uniform-400||LKH [Helsgaun2000], h=23||TSP||Generated by [Johnson and McGeoch2007], #city=400||500||250||10s|
|LKH-uniform-1000||LKH [Helsgaun2000], h=23||TSP||Generated by [Johnson and McGeoch2007], #city=1000||500||250||10s|
With the bound for , the new bound for is established in Theorem 3.
For any positive constants , the inequality has a solution
Without loss of generality we can assume . Let be a -cover of with , where . By Definition 6, for any there exists , such that ; it follows that and Then,