NP-complete problems are ubiquitous in AI. Luckily, while these problems may be hard to solve on worst-case inputs, it is often feasible to solve even large problem instances that arise in practice.
Less luckily, state-of-the-art algorithms often exhibit extreme
runtime variation across instances from realistic distributions, even when problem size is held constant, and conversely the same instance can take
different amounts of time to solve depending on the algorithm used
et al. (2000). There is little theoretical understanding of what causes this variation.
Over the past decade, a considerable body of work has shown how to use supervised machine learning methods to build regression models that provide approximate answers to this question based on given algorithm performance data; we survey this work in Section 2.
In this article, we refer to such models as empirical performance models (EPMs).111In work aiming to gain insights into instance hardness beyond the worst case, we have used the term empirical hardness model Leyton-Brown
et al. (2002, 2009, 2013) .
Similar regression models can also be used to predict objectives other than runtime;
examples include an algorithm’s success probability
. Similar regression models can also be used to predict objectives other than runtime; examples include an algorithm’s success probabilityHowe et al. (2000); Roberts & Howe (2007), the solution quality an optimization algorithm achieves in a fixed time Ridge & Kudenko (2007); Chiarandini & Goegebeur (2010); Hutter et al. (2012), approximation ratio of greedy local search Mersmann et al. (2013), or the SAT competition scoring function Xu et al. (2008). We reflect this broadened scope by using the term EPMs, which we understand as an umbrella that includes EHMs. These models are useful in a variety of practical contexts:
Algorithm selection. This classic problem of selecting the best from a given set of algorithms on a per-instance basis Rice (1976); Smith-Miles (2009) has been successfully addressed by using EPMs to predict the performance of all candidate algorithms and selecting the one predicted to perform best Brewer (1995); Lobjois & Lemaître (1998); Fink (1998); Howe et al. (2000); Roberts & Howe (2007); Xu et al. (2008); Kotthoff et al. (2012).
Parameter tuning and algorithm configuration. EPMs are useful for these problems in at least two ways. First, they can model the performance of a parameterized algorithm dependent on the settings of its parameters; in a sequential model-based optimization process, one alternates between learning an EPM and using it to identify promising settings to evaluate next Jones et al. (1998); Bartz-Beielstein et al. (2005); Hutter et al. (2010, 2011b, 2012). Second, EPMs can model algorithm performance dependent on both problem instance features and algorithm parameter settings; such models can then be used to select parameter settings with good predicted performance on a per-instance basis Hutter et al. (2006).
Gaining insights into instance hardness and algorithm performance. EPMs can be used to assess which instance features and algorithm parameter values most impact empirical performance. Some models support such assessments directly Ridge & Kudenko (2007); Mersmann et al. (2013)
. For other models, generic feature selection methods, such as forward selection, can be used to identify a small number of key model inputs (often fewer than five) that explain algorithm performance almost as well as the whole set of inputsLeyton-Brown et al. (2009); Hutter et al. (2013b).
While these applications motivate our work, in the following, we will not discuss them in detail; instead, we focus on the models themselves. The idea of modelling algorithm runtime is no longer new; however, we have made substantial recent progress in making runtime prediction methods more general, scalable and accurate. After a review of past work (Section 2) and of the runtime prediction methods used by this work (Section 3), we describe four new contributions.
We introduce new instance features for propositional satisfiability (SAT), travelling salesperson (TSP) and mixed integer programming (MIP) problems—in particular, novel probing features and timing features—yielding comprehensive sets of 138, 121, and 64 features for SAT, MIP, and TSP, respectively (Section 5).
To assess the impact of these advances and to determine the current state of the art, we performed what we believe is the most comprehensive evaluation of runtime prediction methods to date. Specifically, we evaluated all methods of which we are aware on performance data for 11 algorithms and 35 instance distributions spanning SAT, TSP and MIP and considering three different problems: predicting runtime on novel instances (Section 6), novel parameter configurations (Section 7), and both novel instances and configurations (Section 8).
Techniques from the statistical literature on survival analysis offer ways to better handle data from runs that were terminated prematurely. While these techniques were not used in most previous work—leading us to omit them from the comparison above—we show how to leverage them to achieve further improvements to our best-performing model, random forests (Section 9).222We used early versions of the new modeling techniques described in Section 4, as well as the extensions to censored data described in Section 9 in recent conference and workshop publications on algorithm configuration Hutter et al. (2010, 2011b, 2012, 2011a). This article is the first to comprehensively evaluate the quality of these models.
2 An Overview of Related Work
Because the problems have been considered by substantially different communities, we separately consider related work on predicting the runtime of parameterless and parameterized algorithms, and applications of these predictions to gain insights into instance hardness and algorithm parameters.
2.1 Related Work on Predicting Runtime of Parameterless Algorithms
The use of statistical regression methods for runtime prediction has its roots in a range of different communities and dates back at least to the mid-1990s. In the parallel computing literature, Brewer used linear regression models to predict the runtime of different implementations of portable, high-level libraries for multiprocessors, aiming to automatically select the best implementation on a novel architectureBrewer (1994, 1995). In the AI planning literature, Fink (1998) used linear regression to predict how the performance of three planning algorithms depends on problem size and used these predictions for deciding which algorithm to run for how long. In the same community, Howe and co-authors Howe et al. (2000); Roberts & Howe (2007) used linear regression to predict how both a planner’s runtime and its probability of success depend on various features of the planning problem; they also applied these predictions to decide, on a per-instance basis, which of a finite set of algorithms should be run in order to optimize a performance objective such as expected runtime. Specifically, they constructed a portfolio of planners that ordered algorithms by their expected success probability divided by their expected runtime. In the constraint programming literature, Leyton-Brown et al. Leyton-Brown et al. (2002, 2009)
studied the winner determination problem in combinatorial auctions and showed that accurate runtime predictions could be made for several different solvers and a wide variety of instance distributions. That work considered a variety of different regression methods (including lasso regression, multivariate adaptive regression splines, and support vector machine regression) but in the end settled on a relatively simpler method: ridge regression with preprocessing to select an appropriate feature subset, a quadratic basis function expansion, and a log-transformation of the response variable. (We formally define this and other regression methods in Section3.) The problem-independent runtime modelling techniques from that work were subsequently applied to the SAT problem Nudelman et al. (2004), leading to the successful portfolio-based algorithm selection method SATzilla Nudelman et al. (2003, 2004); Xu et al. (2007, 2008). Most recently, in the machine learning community, Huang et al. (2010) applied linear regression techniques to the modeling of algorithms with low-order polynomial runtimes.
Due to the extreme runtime variation often exhibited by algorithms for solving combinatorial problems, it is common practice to terminate unsuccessful runs after they exceed a so-called captime. Capped runs only yield a lower bound on algorithm runtime, but are typically treated as having succeeded at the captime. Fink (1998) was the first to handle such right-censored data points more soundly for runtime predictions of AI planning methods and used the resulting predictions to compute captimes that maximize a given utility function. Gagliolo et al. Gagliolo & Schmidhuber (2006); Gagliolo & Legrand (2010) made the connection to the statistical literature on survival analysis to handle right-censored data in their work on dynamic algorithm portfolios. Subsequently, similar techniques were used for SATzilla’s runtime predictions Xu et al. (2007) and in model-based algorithm configuration Hutter et al. (2011a).
Recently, Smith-Miles et al. published a series of papers on learning-based approaches for characterizing instance hardness for a wide variety of hard combinatorial problems Smith-Miles (2009); Smith-Miles et al. (2010); Smith-Miles & Tan (2012); Smith-Miles & Lopes (2012)
. Their work considered a range of tasks, including not only performance prediction, but also clustering, classification into easy and hard instances, as well as visualization. In the context of performance prediction, on which we focus in this article, theirs is the only work known to us to use neural network models. Also recently,Kotthoff et al. (2012) compared regression, classification, and ranking algorithms for algorithm selection and showed that this choice matters: poor regression and classification methods yielded worse performance than the single best solver, while good methods yielded better performance.
Several other veins of performance prediction research deserve mention. Haim & Walsh (2008)
extended linear methods to the problem of making online estimates of SAT solver runtimes. Several researchers have applied supervised classification to select the fastest algorithm for a problem instanceGuerri & Milano (2004); Gebruers et al. (2004); Guo & Hsu (2004); Gebruers et al. (2005); Xu et al. (2012a) or to judge whether a particular run of a randomized algorithm would be good or bad Horvitz et al. (2001) (in contrast to our topic of predicting performance directly using a regression model). In the machine learning community, meta-learning aims to predict the accuracy of learning algorithms Vilalta & Drissi (2002). Meta-level control for anytime algorithms computes estimates of an algorithm’s performance in order to decide when to stop it and act on the solution found Hansen & Zilberstein (1996). Algorithm scheduling in parallel and distributed systems has long relied on low-level performance predictions, for example based on source code analysis Nudd et al. (2000). In principle, the methods discussed in this article could also be applied to meta-level control and algorithm scheduling.
Other research has aimed to identify single quantities that correlate with an algorithm’s runtime. A famous early example is the clauses-to-variables ratio for uniform-random 3-SAT Cheeseman et al. (1991); Mitchell et al. (1992). Earlier still, Knuth showed how to use random probes of a search tree to estimate its size Knuth (1975); subsequent work refined this approach Lobjois & Lemaître (1998); Kilby et al. (2006). We incorporated such predictors as features in our own work and therefore do not evaluate them separately. (We note, however, that we have found Knuth’s tree-size estimate to be very useful for predicting runtime in some cases, e.g., for complete SAT solvers on unsatisfiable 3-SAT instances Nudelman et al. (2004).) The literature on search space analysis has proposed a variety of quantities correlated with the runtimes of (mostly) local search algorithms. Prominent examples include fitness distance correlation Jones & Forrest (1995) and autocorrelation length (ACL) Weinberger (1990). With one exception (ACL for TSP) we have not included such measures in our feature sets, as computing them can be quite expensive.
2.2 Related Work on Predicting Runtime of Parameterized Algorithms
In principle, it is not particularly harder to predict the runtimes of parameterized algorithms than the runtimes of their parameterless cousins: parameters can be treated as additional inputs to the model (notwithstanding the fact that they describe the algorithm rather than the problem instance, and hence are directly controllable by the experimenter), and a model can be learned in the standard way. In past work, we pursued precisely this approach, using both linear regression models and exact Gaussian processes to model the dependency of runtime on both instance features and algorithm parameter values Hutter et al. (2006). However, this direct application of methods designed for parameterless algorithms is effective only for small numbers of continuous-valued parameters (e.g., the experiments in Hutter et al. (2006) considered only two parameters). Different methods are more appropriate when an algorithm’s parameter space becomes very large. In particular, a careful sampling strategy must be used, making it necessary to consider issues raised in the statistics literature on experimental design. Separately, models must be adjusted to deal with categorical parameters: parameters with finite, unordered domains (e.g.
, selecting which of various possible heuristics to use, or activating an optional preprocessing routine).
The experimental design literature uses the term response surface model (RSM) to refer to a predictor for the output of a process with controllable input parameters that can generalize from observed data to new, unobserved parameter settings (see, e.g., Box & Wilson, 1951; Box & Draper, 2007). Such RSMs are at the core of sequential model-based optimization methods for blackbox functions Jones et al. (1998), which have recently been adapted to applications in automated parameter tuning and algorithm configuration (see, e.g., Bartz-Beielstein et al., 2005; Bartz-Beielstein, 2006; Hutter et al., 2009, 2010, 2011b).
Most of the literature on RSMs of algorithm performance has limited its consideration to algorithms running on single problem instances and algorithms only with continuous input parameters. We are aware of a few papers beyond our own that relax these assumptions. Bartz-Beielstein & Markon (2004) support categorical algorithm parameters (using regression tree models), and two existing methods consider predictions across both different instances and parameter settings. First, Ridge & Kudenko (2007)
applied an analysis of variance (ANOVA) approach to detect important parameters, using linear and quadratic models. Second,Chiarandini & Goegebeur (2010) noted that in constrast to algorithm parameters, instance characteristics cannot be controlled and should be treated as so-called random effects. Their resulting mixed-effects models are linear and, like Ridge & Kudenko’s ANOVA model, assume Gaussian performance distributions. We note that this normality assumption is much more realistic in the context of predicting solution quality of local search algorithms (the problem addressed in Chiarandini & Goegebeur (2010)) than in the context of the algorithm runtime prediction problem we tackle here.
2.3 Related Work on Applications of Runtime Prediction to Gain Insights into Instance Hardness and Algorithm Parameters
Leyton-Brown and co-authors Leyton-Brown et al. (2002); Nudelman et al. (2004); Leyton-Brown et al. (2009) employed forward selection with linear regression models to determine small sets of instance features that suffice to yield high-quality predictions, finding that often as little as five to ten features yielded predictions as good as the full feature set. Hutter et al. (2013b) extended that work to predictions in the joint space of instance features and algorithm parameters, using arbitrary models. Two model-specific approaches for this joint identification of instance features and algorithm parameters are the ANOVA approach of Ridge & Kudenko (2007) and the mixed-effects model of Chiarandini & Goegebeur (2010) mentioned previously. Other approaches for quantifying parameter importance include an entropy-based measure Nannen & Eiben (2007), and visualization methods for interactive parameter exploration Bartz-Beielstein (2006).
3 Methods Used in Related Work
We now define the different machine learning methods that have been used to predict algorithm runtimes: ridge regression (used by Brewer (1994, 1995); Leyton-Brown et al. (2002, 2009); Nudelman et al. (2003, 2004); Hutter et al. (2006); Xu et al. (2007, 2008); Huang et al. (2010)), neural networks (see Smith-Miles & van Hemert (2011)), Gaussian process regression (see Hutter et al. (2006)), and regression trees (see Bartz-Beielstein & Markon (2004)). This section provides the basis for the experimental evaluation of different methods in Sections 6, 7, and 8; thus, we also discuss implementation details.
We describe a problem instance by a list of features , drawn from a given feature space . These features must be computable by a piece of problem-specific code (usually provided by a domain expert) that efficiently extracts characteristics for any given problem instance (typically, in low-order polynomial time w.r.t. to the size of the given problem instance). We define the configuration space of a parameterized algorithm with parameters with respective domains as a subset of the cross-product of parameter domains: . The elements of are complete instantiations of the algorithm’s parameters, and we refer to them as configurations. Taken together, the configuration and the feature spaces define the input space: .
denote the space of probability distributions over the real numbers; we will use these real numbers to represent an algorithm performance measure, such as runtime in seconds on some reference machine. (In principle, EPMs can predict any type of performance measure that can be evaluated in single algorithm runs, such as runtime, solution quality, memory usage, energy consumption, or communication overhead.) Given an algorithmwith configuration space and a distribution of instances with feature space , an EPM is a stochastic process that defines a probability distribution over performance measures for each combination of a parameter configuration of and a problem instance with features . The prediction of an entire distribution allows us to assess the model’s confidence at a particular input, which is essential, e.g., in model-based algorithm configuration Bartz-Beielstein et al. (2005); Bartz-Beielstein (2006); Hutter et al. (2009, 2011b). Nevertheless, since many of the methods we review yield only point-valued runtime predictions, our experimental analysis focuses on the accuracy of mean predicted runtimes. For the models that define a predictive distribution (Gaussian processes and our variant of random forests), we study the accuracy of confidence values separately in the online appendix, with qualitatively similar results as for mean predictions.
To construct an EPM for an algorithm with configuration space on an instance set , we run on various combinations of configurations and instances , and record the resulting performance values . We record the -dimensional parameter configuration and the -dimensional feature vector of the instance used in the th run, and combine them to form a -dimensional vector of predictor variables . The training data for our regression models is then simply . We use to denote the matrix containing (the so-called design matrix) and for the vector of performance values .
Various transformations can make this data easier to model. In this article, we focus on runtime as a performance measure and use a log-transformation, thus effectively predicting log runtime.333Due to the resolution of our CPU timer, runtimes below seconds are measured as seconds. To make well defined in these cases, we count them as (which, in log space, has the same distance from 0.01 as the next bigger value measurable with our CPU timer, 0.02).
In our experience, we have found this transformation to be very important due to the large variation in runtimes for hard combinatorial problems. We also transformed the predictor variables, discarding those input dimensions constant across all training data points and normalizing the remaining ones to have mean 0 and standard deviation 1 (i.e., for each input dimension we subtracted the mean and then divided by the standard deviation).
For some instances, certain feature values can be missing because of timeouts, crashes, or because they are undefined (when preprocessing has already solved an instance). These missing values occur relatively rarely, so we use a simple mechanism for handling them. We disregard missing values for the purposes of normalization, and then set them to zero for training our models. This means that missing feature values are effectively assumed to be equal to the mean for the respective distribution and thus to be minimally informative. In some models (ridge regression and neural networks), this mechanism leads us to ignore missing features, since their weight is multiplied by zero.
Most modeling methods discussed in this paper have free hyperparameters that can be set by minimizing some loss function, such as cross-validation error. We point out these hyper-parameters, as well as their default setting, when discussing each of the methods. While, to the best of our knowledge, all previous work on runtime prediction has used fixed default hyperparameters, we also experimented with optimizing them for every method in our experiments. For this purpose, we used the gradient-free optimizer DIRECTJones et al. (1993) to minimize 2-fold cross-validated root mean squared error (RMSE) on the training set with a budget of 30 function evaluations. This simple approach is a better alternative than the frequently-used grid search and random search Bergstra & Bengio (2012).
3.2 Ridge Regression
Ridge regression (see, e.g., Bishop, 2006) is a simple regression method that fits a linear function of its inputs . Due to its simplicity (both conceptual and computational) and its interpretability, combined with competitive predictive performance in most scenarios we studied, this is the method that has been used most frequently in the past for building EPMs Fink (1998); Howe et al. (2000); Leyton-Brown et al. (2002, 2009); Nudelman et al. (2004); Hutter et al. (2006); Xu et al. (2007).
Ridge regression works as follows. Let and be as defined above, let be the identity matrix, and let be a small constant. Then, compute the weight vector
Given a new feature vector, , ridge regression predicts . Observe that with , we recover standard linear regression. The effect of is to regularize the model by penalizing large coefficients ; it is equivalent to a Gaussian prior favouring small coefficients under a Bayesian model (see, e.g., Bishop (2006)). A beneficial side effect of this regularization is that numerical stability improves in the common case where is rank deficient, or nearly so. The computational bottleneck in ridge regression with input dimensions is the inversion of the matrix , which requires time cubic in .
Algorithm runtime can often be better approximated by a polynomial function than by a linear one, and the same holds for log runtimes. For that reason, it can make sense to perform a basis function expansion to create new features that are products of two or more original features. In light of the resulting increase in the number of features, a quadratic expansion is particularly appealing. Formally, we augment each model input with pairwise product inputs for and .
Even with ridge regularization, the generalization performance of linear regression (and, indeed, many other learning algorithms) can deteriorate when some inputs are uninformative or highly correlated with others; in our experience, it is difficult to construct sets of instance features that do not suffer from these problems. Instead, we reduce the set of input features by performing feature selection. Many different methods exist for feature expansion and selection; we review two different ridge regression variants from the recent literature that only differ in these design decisions.444We also considered a third ridge regression variant that was originally proposed by Leyton-Brown et al. (2009) (“ridge regression with elimination of redundant features”, or RR-el for short). Unfortunately, running this method was computationally infeasible, considering the large number of features we consider in this paper, (a) forcing us to approximate the method, and (b) nevertheless preventing us from performing 10-fold cross-validation. Because these hurdles made it impossible to fairly compare RR-el to other methods, we do not discuss RR-el here. However, for completeness, our online appendix includes both a definition of our approximation to RR-el and experimental results showing it to perform worse than ridge regression variant RR in 34/35 cases.
For more than half a decade, we used a simple and scalable feature selection method based on forward selection (see e.g., Guyon et al., 2006) to build the regression models used by SATzilla Xu et al. (2007, 2008). This iterative method starts with an empty input set, greedily adds one linear input at a time to minimize cross-validation error at each step, and stops when linear inputs have been selected. It then performs a full quadratic expansion of these linear features (using the original, unnormalized features, and then normalizing the resulting quadratic features again to have mean zero and standard deviation one). Finally, it carries out another forward selection with the expanded feature set, once more starting with an empty input set and stopping when features have been selected. The reason for the two-phase approach is scalability: this method prevents us from ever having to perform a full quadratic expansion of our features. (For example, we have often employed over features and a million runtime measurements; in this case, a full quadratic expansion would involve over billion feature values.)
Our implementation reduces the computational complexity of forward selection by exploiting the fact that the inverse matrix resulting from including one additional feature can be computed incrementally by two rank-one updates of the previous inverse matrix , requiring quadratic time rather than cubic time Sherman & Morrison (1949).
In our experiments, we fixed the number of linear inputs to in order to keep the result of a full quadratic basis function expansion manageable in size (with 1 million data points, the resulting matrix has , or about million elements). The maximum number of quadratic terms and the ridge penalizer are free parameters of this method; by default, we used and .
3.2.2 Ridge Regression Variant SPORE-FoBa: Forward-backward selection Huang et al. (2010)
Recently, Huang et al. (2010) described a method for predicting algorithm runtime that they called Sparse POlynomial REgression (SPORE), which is based on ridge regression with forward-backward (FoBa) feature selection.555Although this is not obvious from their publication Huang et al. (2010), the authors confirmed to us that FoBa uses ridge rather than LASSO regression, and also gave us their original code. Huang et al. concluded that SPORE-FoBa outperforms lasso regression, which is consistent with the comparison to lasso by Leyton-Brown et al. (2009). In contrast to the RR variants above, SPORE-FoBa employs a cubic feature expansion (based on its own normalizations of the original predictor variables). Essentially, it performs a single pass of forward selection, at each step adding a small set of terms determined by a forward-backward phase on a feature’s candidate set. Specifically, having already selected a set of terms based on raw features , SPORE-FoBa loops over all raw features , constructing a candidate set that consists of all polynomial expansions of that include with non-zero degree and whose total degree is bounded by 3. For each such candidate set , the forward-backward phase iteratively adds the best term , if its reduction of root mean squared error (RMSE) exceeds a threshold (forward step), and then removes the worst term , if its reduction of RMSE is below (backward step). This phase terminates when no single term can be added to reduce RMSE by more than . Finally, SPORE-FoBa’s outer forward selection loop chooses the set of terms resulting from the best of its forward-backward phases, and iterates until the number of terms in reach a prespecified maximum of terms. In our experiments, we used the original SPORE-FoBa code; its free parameters are the ridge penalizer , , and , with defaults , , and .
3.3 Neural Networks
Neural networks are a well-known regression method inspired by information processing in the human brain. The multilayer perceptron (MLP) is a particularly popular type of neural network that organizes single computational units (“neurons”) in layers (input, hidden, and output layers), using the outputs of all units in a layer as the inputs of all units in the next layer. Each neuronin the hidden and output layers with inputs has an associated weight term vector and a bias term , and computes a function
. For neurons in the hidden layer, the result of this function is further propagated through a nonlinear activation function(which is often chosen to be ). Given an input , a network with a single hidden layer of neurons and a single output neuron then computes output
The weight terms and bias terms can be combined into a single weight vector , which can be set to minimize the network’s prediction error using any continuous optimization algorithm (e.g.
, the classic “backpropagation” algorithm performs gradient descent to minimize squared prediction error).
Smith-Miles & van Hemert (2011) used an MLP with one hidden layer of 28 neurons to predict the runtime of local search algorithms for solving timetabling instances. They used the proprietary neural network software Neuroshell, but advised us to compare to an off-the-shelf Matlab implementation instead. We thus employed the popular Matlab neural network package NETLAB Nabney (2002). NETLAB uses activation function and supports a regularizing prior to keep weights small, minimizing the error metric , where is a parameter determining the strength of the prior. In our experiments, we used NETLAB’s default optimizer (scaled conjugate gradients, SCG) to minimize this error metric, stopping the optimization after the default of 100 SCG steps. Free parameters are the regularization factor and the number of hidden neurons ; we used NETLAB’s default and, like Smith-Miles & van Hemert (2011), .
3.4 Gaussian Process Regression
Stochastic Gaussian processes (GPs) Rasmussen & Williams (2006) are a popular class of regression models with roots in geostatistics, where they are also called Kriging models Krige (1951). GPs are the dominant modern approach for building response surface models Sacks et al. (1989); Jones et al. (1998); Santner et al. (2003); Bartz-Beielstein (2006). They were first applied to runtime prediction by Hutter et al. (2006), who found them to yield better results than ridge regression, albeit at greater computational expense.
To construct a GP regression model, we first need to select a kernel function , characterizing the degree of similarity between pairs of elements of the input space . A variety of kernel functions are possible, but the most common choice for continuous inputs is the squared exponential kernel
where are kernel parameters. It is based on the idea that correlations decrease with weighted Euclidean distance in the input space (weighing each dimension by a kernel parameter ). In general, such a kernel defines a prior distribution over the type of functions we expect. This distribution takes the form of a Gaussian stochastic process
: a collection of random variables such that any finite subset of them has a joint Gaussian distribution. What remains to be specified is the tradeoff between the strength of this prior and fitting observed data, which is set by specifying theobservation noise
. Standard GPs assume normally distributed observation noise with mean zero and variance, where , like the kernel parameters , can be optimized to improve the fit. Combining the prior specified above with the training data yields the posterior distribution at a new input point (see the book by Rasmussen & Williams (2006) for a derivation):
with mean and variance
The GP equations above assume fixed kernel parameters and fixed observation noise variance . These constitute the GP’s hyperparameters. In contrast to hyperparameters in other models, the number of GP hyperparameters grows with the input dimensionality, and their optimization is an integral part of fitting a GP: they are typically set by maximizing the marginal likelihood of the data with a gradient-based optimizer (again, see Rasmussen & Williams (2006) for details). The choice of optimizer can make a big difference in practice; we used the minFunc Schmidt (2012) implementation of a limited-memory version of BFGS Nocedal & Wright (2006).
Learning a GP model from data can be computationally expensive. Inverting the matrix takes time and has to be done in every of the hyperparameter optimization steps, yielding a total complexity of . Subsequent predictions at a new input require only time and for the mean and the variance, respectively.
3.5 Regression Trees
Regression trees Breiman et al. (1984) are simple tree-based regression models. They are known to handle discrete inputs well; their first application to the prediction of algorithm performance was by Bartz-Beielstein & Markon (2004). The leaf nodes of regression trees partition the input space into disjoint regions , and use a simple model for prediction in each region ; the most common choice is to predict a constant . This leads to the following prediction for an input point :
where the indicator function takes value if the proposition is true and otherwise. Note that since the regions partition the input space, this sum will always involve exactly one non-zero term. We denote the subset of training data points in region as . Under the standard squared error loss function , the error-minimizing choice of constant in region is then the sample mean of the data points in :
To construct a regression tree, we use the following standard recursive procedure, which starts at the root of the tree with all available training data points . We consider binary partitionings of a given node’s data along split variables j and split points s. For a real-valued split variable , is a scalar and data point is assigned to region if and to region otherwise. For a categorical split variable , is a set, and data point is assigned to region if and to region otherwise. At each node, we select split variable and split point to minimize the sum of squared differences to the regions’ means,
where and are chosen according to Equation (3) as the sample means in regions and , respectively. We continue this procedure recursively, finding the best split variable and split point, partitioning the data into two child nodes, and recursing into the child nodes. The process terminates when all training data points in a node share the same values, meaning that no more splits are possible. This procedure tends to overfit data, which can be mitigated by recursively pruning away branches that contribute little to the model’s predictive accuracy. We use cost-complexity pruning with 10-fold cross-validation to identify the best tradeoff between complexity and predictive quality; see the book by Hastie et al. (2009) for details.
In order to predict the response value at a new input, , we propagate down the tree, that is, at each node with split variable and split point , we continue to the left child node if (for real-valued variable ) or
(for categorical variable), and to the right child node otherwise. The predictive mean for is the constant in the leaf that this process selects; there is no variance predictor.
3.5.1 Complexity of Constructing Regression Trees
If implemented efficiently, the computational cost of fitting a regression tree is small. At a single node with data points of dimensionality , it takes time to identify the best combination of split variable and point, because for each continuous split variable , we can sort the values and only consider up to possible split points between different values. The procedure for categorical split variables has the same complexity: we consider each of the variable’s categorical values , compute score across the node’s data points, sort by these scores, and only consider the binary partitions with consecutive scores in each set. For the squared error loss function we use, the computation of (see Equation (4)) can be performed in amortized time for each of ’s split points , such that the total time required for determining the best split point of a single variable is . The complexity of building a regression tree depends on how balanced it is. In the worst case, one data point is split off at a time, leading to a tree of depth and a complexity of , which is . In the best case—a balanced tree—we have the recurrence relation , leading to a complexity of . In our experience, trees are not perfectly balanced, but are much closer to the best case than to the worst case. For example, data points typically led to tree depths between 25 and 30 (whereas ).
Prediction with regression trees is cheap; we merely need to propagate new query points down the tree. At each node with continuous split variable and split point , we only need to compare to , an operation. For categorical split variables, we can store a bit mask of the values in to enable member queries. In the worst case (where the tree has depth ), prediction thus takes time, and in the best (balanced) case it takes time.
4 New Modeling Techniques for EPMs
In this section we extend existing modeling techniques for EPMs, with the primary goal of improving runtime predictions for highly parameterized algorithms. The methods described here draw on advanced machine learning techniques, but, to the best of our knowledge, our work is the first to have applied them for algorithm performance prediction. More specifically, we show how to extend all models to handle categorical inputs (required for predictions in partially categorical configuration spaces) and describe two new model families well-suited to modeling the performance of highly parameterized algorithms based on potentially large amounts of data: the projected process approximation to Gaussian processes and random forests of regression trees.
4.1 Handling Categorical Inputs
Empirical performance models have historically been limited to continuous-valued inputs; the only approach that has so far been used for performance predictions based on discrete-valued inputs is regression trees Bartz-Beielstein & Markon (2004). In this section, we first present a standard method for encoding categorical parameters as real-valued parameters, and then present a kernel for handling categorical inputs more directly in Gaussian processes.
4.1.1 Extension of Existing Methods Using 1-in- Encoding
A standard solution for extending arbitrary modeling techniques to handle categorical inputs is the so-called -in- encoding scheme (see, e.g., Bishop, 2006), which encodes categorical inputs with finite domain size as binary inputs. Specifically, if the th column of the design matrix is categorical with domain , we replace it with binary indicator columns, where the new column corresponding to each contains values ; for each data point, exactly one of the new columns is 1, and the rest are all 0. After this transformation, the new columns are treated exactly like the original real-valued columns, and arbitrary modeling techniques for numerical inputs become applicable.
4.1.2 A Weighted Hamming Distance Kernel for Categorical Inputs in GPs
A problem with the 1-in- encoding is that using it increases the size of the input space considerably, causing some regression methods to perform poorly. We now define a kernel for handling categorical inputs in GPs more directly. Our kernel is similar to the standard squared exponential kernel of Equation (1), but instead of measuring the (weighted) squared distance, it computes a (weighted) Hamming distance:
For a combination of continuous and categorical input dimensions and , we combine the two kernels:
Although is a straightforward adaptation of the standard kernel in Equation (1), we are not aware of any prior use of it. To use this kernel in GP regression, we have to show that it is positive definite.
Definition 1 (Positive definite kernel).
A function is a positive definite kernel iff it is (1) symmetric: for any pair of inputs , satisfies ; and (2) positive definite: for any inputs and any constants , satisfies .
Proposition 2 ( is positive definite).
For any combination of continuous and categorical input dimensions and , is a positive definite kernel function.
Appendix B in the online appendix provides the proof, which shows that can be constructed from simpler positive definite functions, and uses the facts that the space of positive definite kernel functions is closed under addition and multiplication.
Our new kernel can be understood as implicitly performing a 1-in- encoding. Note that Kernel has one hyperparameter for each input dimension. By using a 1-in- encoding and kernel instead, we end up with one hyperparameter for each encoded dimension; if we then reparameterize to share a single hyperparameter across the encoded dimensions resulting from a single original input dimension , we recover .
Since is rather expressive, one may worry about overfitting. Thus, we also experimented with two variations: (1) sharing the same hyperparameter across all input dimensions; and (2) sharing across algorithm parameters and across instance features. We found that neither variation outperformed .
4.2 Scaling to Large Amounts of Data with Approximate Gaussian Processes
The time complexity of fitting Gaussian processes is cubic in the number of data points, which limits the amount of data that can be used in practice to fit these models. To deal with this obstacle, the machine learning literature has proposed various approximations to Gaussian processes (see, e.g., Quinonero-Candela et al., 2007). To the best of our knowledge, these approximate GPs have previously been applied to runtime prediction only in our work on parameter optimization Hutter et al. (2010) (considering parameterized algorithms, but only single problem instances). We experimented with the Bayesian committee machine Tresp (2000), the informative vector machine Lawrence et al. (2003), and the projected process (PP) approximation Rasmussen & Williams (2006). All of these methods performed similarly, with the PP approximation having a slight edge. Below, we give the equations for the PP’s predictive mean and variance; for a derivation, see the Rasmussen & Williams (2006).
The PP approximation to GPs uses a subset of of the training data points, the so-called active set. Let be a vector consisting of the indices of these data points. We extend the notation for exact GPs (see Section 3.4) as follows: let denote the by matrix with and let denote the by matrix with . The predictive distribution of the PP approximation is then a normal distribution with mean and variance
We perform steps of hyperparameter optimization based on a standard GP, trained using a set of data points sampled uniformly at random without replacement from the input data points. We then use the resulting hyperparameters and another independently sampled set of data points (sampled in the same way) for the subsequent PP approximation. In both cases, if , we only use data points.
The complexity of the PP approximation is superlinear only in ; therefore, the approach is much faster when we choose . The hyperparameter optimization based on data points takes time . In addition, there is a one-time cost of for evaluating the PP equations. Thus, the time complexity for fitting the approximate GP model is , as compared to for the exact GP model. The time complexity for predictions with this PP approximation is for the mean and for the variance of the predictive distribution Rasmussen & Williams (2006), as compared to and , respectively, for the exact version. In our experiments, we set and to achieve a good compromise between speed and predictive accuracy.
4.3 Random Forest Models
Regression trees, as discussed in Section 3.5, are a flexible modeling technique that is particularly effective for discrete input data. However, they are also well known to be sensitive to small changes in the data and are thus prone to overfitting. Random forests Breiman (2001) overcome this problem by combining multiple regression trees into an ensemble. Known for their strong predictions for high-dimensional and discrete input data, random forests are an obvious choice for runtime predictions of highly parameterized algorithms. Nevertheless, to the best of our knowledge, they have not been used for algorithm runtime prediction except in our own recent work on algorithm configuration Hutter et al. (2010, 2011b, 2011a, 2012), which used a prototype implementation of the models we describe here.666Note that random forests have also been found to be effective in predicting the approximation ratio of 2-opt on Euclidean TSP instances Mersmann et al. (2013). In the following, we describe the standard RF framework and some nonstandard implementation choices we made.
4.3.1 The Standard Random Forest Framework
A random forest (RF) consists of a set of regression trees. If grown to sufficient depths, regression trees are extraordinarily flexible predictors, able to capture very complex interactions and thus having low bias. However, this means they can also have high variance: small changes in the data can lead to a dramatically different tree. Random forests Breiman (2001) reduce this variance by aggregating predictions across multiple different trees. (This is an alternative to the pruning procedure described previously; thus, the trees in random forests are not pruned, but are rather grown until each node contains no more than data points.) These trees are made to be different by training them on different subsamples of the training data, and/or by permitting only a random subset of the variables as split variables at each node. We chose the latter option, using the full training set for each tree. (We did experiment with a combination of the two approaches, but found that it yielded slightly worse performance.)
Mean predictions for a new input are trivial: predict the response for with each tree and average the predictions. The predictive quality improves as the number of trees, , grows, but computational cost also grows linearly in . We used throughout our experiments to keep computational costs low. Random forests have two additional hyperparameters: the percentage of variables to consider at each split point, perc, and the minimal number of data points required in a node to make it eligible to be split further, . We set and by default.
4.3.2 Modifications to Standard Random Forests
We introduce a simple, yet effective, method for quantifying predictive uncertainty in random forests. (Our method is similar in spirit to that of Meinshausen (2006)
, who recently introduced quantile regression trees, which allow for predictions of quantiles of the predictive distribution; in contrast, we predict a mean and a variance.) In each leaf of each regression tree, in addition to the empirical mean of the training data associated with that leaf, we store the empirical variance of that data. To avoid making deterministic predictions for leaves with few data points, we round the stored variance up to at least the constant; we set throughout. For any input, each regression tree thus yields a predictive mean and a predictive variance . To combine these estimates into a single estimate, we treat the forest as a mixture model of different models. We denote the random variable for the prediction of tree as and the overall prediction as , and then have where is a multinomial variable with for . The mean and variance for can then be expressed as:
Thus, our predicted mean is simply the mean across the means predicted by the individual trees in the random forest. To compute the variance prediction, we used the law of total variance (see, e.g., Weiss, 2005), which allows us to write the total variance as the variance across the means predicted by the individual trees (predictions are uncertain if the trees disagree), plus the average variance of each tree (predictions are uncertain if the predictions made by individual trees tend to be uncertain).
A second non-standard ingredient in our models concerns the choice of split points. Consider splits on a real-valued variable . Note that when the loss in Equation (4) is minimized by choosing split point between the values of and , we are still free to choose the exact location of anywhere in the interval . Traditionally, is chosen as the midpoint between and . Instead, here we draw it uniformly at random from
. In the limit of an infinite number of trees, this leads to a linear interpolation of the training data instead of a partition into regions of constant prediction. Furthermore, it causes variance estimates to vary smoothly and to grow with the distance from observed data points.
4.3.3 Complexity of Fitting Random Forests
The computational cost for fitting a random forest is relatively low. We need to fit regression trees, each of which is somewhat easier to fit than a normal regression tree, since at each node we only consider out of the possible split variables. Building trees simply takes times as long as building a single tree. Thus—by the same argument as for regression trees—the complexity of learning a random forest is in the worst case (splitting off one data point at a time) and in the best case (perfectly balanced trees). Our random forest implementation is based on a port of Matlab’s regression tree code to C, which yielded speedups of between one and two orders of magnitude.
Prediction with a random forest model entails predicting with regression trees (plus an computation to compute the mean and variance across those predictions). The time complexity of a single prediction is thus in the worst case and for perfectly balanced trees.
5 Problem-Specific Instance Features
While the methods we have discussed so far could be used to model the performance of any algorithm for solving any problem, in our experiments, we investigated specific NP-complete problems. In particular, we considered the propositional satisfiability problem (SAT), mixed integer programming (MIP) problems, and the travelling salesperson problem (TSP). Our reasons for choosing these three problems are as follows. SAT is the prototypical NP-hard decision problem and is thus interesting from a theory perspective; modern SAT solvers are also one of the most prominent approaches in hardware and software verification Prasad et al. (2005). MIP is a canonical representation for constrained optimization problems with integer-valued and continuous variables, which serves as a unifying framework for NP-complete problems and combines the expressive power of integrality constraints with the efficiency of continuous optimization. As a consequence, it is very widely used both in academia and industry International Business Machines Corp. (2011). Finally, TSP is one of the most widely studied NP-hard optimization problems, and also of considerable interest for industry Cook (2012a).
We tailor EPMs to a particular problem through the choice of instance features.777If features are unavailable for an NP-complete problem of interest, one alternative is to reduce the problem to SAT, MIP, or TSP—a polynomial-time operation—and then compute some of the features we describe here. We do not expect this approach to be computationally efficient, but do observe that it extends the reach of existing EPM construction techniques to any NP-complete problem. Here we describe comprehensive sets of features for SAT, MIP, and TSP. For each of these problems, we summarize sets of features found in the literature and introduce many novel features. While all these features are polynomial-time computable, we note that some of them can be computationally expensive for very large instances (e.g., taking cubic time). For some applications such expensive features will be reasonable—in particular, we note that for applications that take features as a one-time input, but build models repeatedly, it can even make sense to use features whose cost exceeds that of solving the instance; examples of such applications include model-based algorithm configuration Hutter et al. (2011b) and complex empirical analyses based on performance predictions Hutter et al. (2010b, 2013b). In runtime-sensitive applications, on the other hand, it may make sense to use only a subset of the features described here. To facilitate this, we categorize all features into one of four “cost classes”: trivial, cheap, moderate, and expensive. In our experimental evaluation, we report the empirical cost of these feature classes and the predictive performance that can be achieved using them (see Table 3 on page 3). We also identify features introduced in this work and quantify their contributions to model performance.
Probing features are a generic family of features that deserves special mention. They are computed by briefly running an existing algorithm for the given problem on the given instance and extracting characteristics from that algorithm’s trajectory—an idea closely related to that of landmarking in meta-learning Pfahringer et al. (2000). Probing features can be defined with little effort for a wide variety of problems; indeed, in earlier work, we introduced the first probing features for SAT Nudelman et al. (2004) and showed that probing features based on one type of algorithm (e.g., local search) are often useful for predicting the performance of another type of algorithm (e.g., tree search). Here we introduce the first probing features for MIP and TSP. Another new, generic family of features are timing features, which measure the time other groups of features take to compute. Code and binaries for computing all our features, along with documentation providing additional details, are available online at http://www.cs.ubc.ca/labs/beta/Projects/EPMs/.
5.1 Features for Propositional Satisfiability (SAT)
Figure 1 summarizes 138 features for SAT. Since various preprocessing techniques are routinely used before applying a general-purpose SAT solver and typically lead to substantial reductions in instance size and difficulty (especially for industrial-like instances), we apply the preprocessing procedure SATElite Eén & Biere (2005) on all instances first, and then compute instance features on the preprocessed instances. The first 90 features, with the exception of features 22–26 and 32–36, were introduced in our previously published work on SATzilla Nudelman et al. (2004); Xu et al. (2008). They can be categorized as problem size features (1–7), graph-based features (8–36), balance features (37–49), proximity to Horn formula features (50–55), DPLL probing features (56–62), LP-based features (63–68), and local search probing features (69–90).
Our new features (devised over the last five years in our ongoing work on SATzilla and so far only mentioned in short solver descriptions Xu et al. (2009, 2012b)) fall into four categories. First, we added two additional subgroups of graph-based features. Our new diameter features 22–26 are based on the variable graph Herwig (2006). For each node in that graph, we compute the longest shortest path between and any other node. As with most of the features that follow, we then compute various statistics over this vector (e.g., mean, max); we do not state the exact statistics for each vector below but list them in Figure 1. Our new clustering coefficient features 32–36 measure the local cliqueness of the clause graph. For each node in the clause graph, let denote the number of edges present between the node and its neighbours, and let denote the maximum possible number of such edges; we compute for each node.
Second, our new clause learning features (91–108) are based on statistics gathered in 2-second runs of Zchaff_rand Mahajan et al. (2005). We measure the number of learned clauses (features 91–99) and the length of the learned clauses (features 100–108) after every 1000 search steps. Third, our new survey propagation features (109–126) are based on estimates of variable bias in a SAT formula obtained using probabilistic inference Hsu et al. (2008). We used VARSAT’s implementation to estimate the probabilities that each variable is true in every satisfying assignment, false in every satisfying assignment, or unconstrained. Features 109–117 measure the confidence of survey propagation (that is, for each variable ) and features 118–126 are based on the vector.
Finally, our new timing features (127–138) measure the time taken by 12 different blocks of feature computation code: instance preprocessing by SATElite, problem size (1–6), variable-clause graph (clause node) and balance features (7, 13–17, 37–41, 47–49); variable-clause graph (variable node), variable graph and proximity to Horn formula features (8–12, 18–21, 42–46, 50–55); diameter-based features (22–26); clause graph features (27–36); unit propagation features (56–60); search space size estimation (61–62); LP-based features (63–68); local search probing features (69–90) with SAPS and GSAT; clause learning features (91–108); and survey propagation features (109–126).
5.2 Features for Mixed Integer Programs
Figure 2 summarizes 121 features for mixed integer programs (i.e., MIP instances). These include 101 features based on existing work Leyton-Brown et al. (2009); Hutter (2009); Kadioglu et al. (2010), 15 new probing features, and 5 new timing features. Features 1–101 are primarily based on features for the combinatorial winner determination problem from our past work Leyton-Brown et al. (2009), generalized to MIP and previously only described in a Ph.D. thesis Hutter (2009). These features can be categorized as problem type & size features (1–25), variable-constraint graph features (26–49), linear constraint matrix features (50–73), objective function features (74–91), and LP-based features (92–95). We also integrated ideas from the feature set used by Kadioglu et al. (2010) (right-hand side features (96–101) and the computation of separate statistics for continuous variables, non-continuous variables, and their union). We extended existing features by adding richer statistics where applicable: medians, variation coefficients (vc), and percentile ratios (q90/q10) of vector-based features.
We introduce two new sets of features. Firstly, our new MIP probing features 102–116 are based on 5-second runs of CPLEX with default settings. They are obtained via the CPLEX API and include 6 presolving features based on the output of CPLEX’s presolving phase (102–107); 5 probing cut usage features describing the different cuts CPLEX used during probing (108–112); and 4 probing result features summarizing probing runs (113–116). Secondly, our new timing features 117–121 capture the CPU time required for computing five different groups of features: variable-constraint graph, linear constraint matrix, and objective features for three subsets of variables (“continuous”, “non-continuous”, and “all”, 26–91); LP-based features (92–95); and CPLEX probing features (102–116). The cost of computing the remaining features (1–25, 96–101) is small (linear in the number of variables or constraints).
5.3 Features for the Travelling Salesperson Problem (TSP)
Figure 3 summarizes 64 features for the travelling salesperson problem (TSP). Features 1–50 are new, while Features 51–64 were introduced by Smith-Miles et al. (2010). Features 51–64 capture the spatial distribution of nodes (features 51–61) and clustering of nodes (features 62–64); we used the authors’ code (available at http://www.vanhemert.co.uk/files/TSP-feature-extract-20120212.tar.gz) to compute these features.
Our 50 new TSP features are as follows.888In independent work, Mersmann et al. (2013) have introduced feature sets similar to some of those described here. The problem size feature (1) is the number of nodes in the given TSP. The cost matrix features (2–4) are statistics of the cost between two nodes. Our minimum spanning tree features (5–11) are based on constructing a minimum spanning tree over all nodes in the TSP: features 5–8 are the statistics of the edge costs in the tree and features 9–11 are based on its node degrees. Our cluster distance features (12–14) are based on the cluster distance between every pair of nodes, which is the minimum bottleneck cost of any path between them; here, the bottleneck cost of a path is defined as the largest cost along the path. Our local search probing features (15–32) are based on 20 short runs (1000 steps each) of LK Lin & Kernighan (1973), using the implementation available from Cook (2012b). Specifically, features 15–17 are based on the tour length obtained by LK; features 18–20, 21–23, and 24–26 are based on the tour length of local minima, the tour quality improvement per search step, and the number of search steps to reach a local minimum, respectively; features 27–29 measure the Hamming distance between two local minima; and features 30–32 describe the probability of edges appearing in any local minimum encountered during probing. Our branch and cut probing features (33–43) are based on 2-second runs of Concorde. Specifically, features 33–35 measure the improvement of lower bound per cut; feature 36 is the ratio of upper and lower bound at the end of the probing run; and features 37–43 analyze the final LP solution. Feature 44 is the autocorrelation coefficient: a measure of the ruggedness of the search landscape, based on an uninformed random walk (see, e.g., Hoos & Stützle (2005)). Finally, our timing features 45–50 measure the CPU time required for computing feature groups 2–7 (the cost of computing the number of nodes can be ignored).
6 Performance Predictions for New Instances
We now study the performance of the models described in Sections 3 and 4, using (various subsets of) the features described in Section 5. In this section, we consider the (only) problem considered by most past work: predicting the performance achieved by the default configuration of a given algorithm on new instances. (We go on to consider making predictions for novel algorithm configurations in Sections 7 and 8.) For brevity, we only present representative empirical results. The full results of our experiments are available in an online appendix at http://www.cs.ubc.ca/labs/beta/Projects/EPMs. All of our data, features, and source code for replicating our experiments is available from the same site.
|RR||3.2||Ridge regression with 2-phase forward selection|
|SP||3.2||SPORE-FoBa (ridge regression with forward-backward selection)|
|NN||3.3||Feed-forward neural network with one hidden layer|
|PP||4.2||Projected process (approximate Gaussian process)|
|RT||3.5||Regression tree with cost-complexity pruning|
6.1 Instances and Solvers
For SAT, we used a wide range of instance distributions: INDU, HAND, and RAND are collections of industrial, handmade, and random instances from the international SAT competitions and races, and COMPETITION is their union; SWV and IBM are sets of software and hardware verification instances, and SWV-IBM is their union; RANDSAT is a subset of RAND containing only satisfiable instances. We give more details about these distributions in A.1. For all distributions except RANDSAT, we ran the popular tree search solver, Minisat 2.0 Eén & Sörensson (2004). For INDU, SWV and IBM, we also ran two additional solvers: CryptoMinisat Soos (2010) (which won SAT Race 2010 and received gold and silver medals in the 2011 SAT competition) and SPEAR Babić & Hutter (2007) (which has shown state-of-the-art performance on IBM and SWV with optimized parameter settings Hutter et al. (2007)). Finally, to evaluate predictions for local search algorithms, we used the RANDSAT instances, and considered two solvers: tnm Wei & Li (2009) (which won the random satisfiable category of the 2009 SAT Competition) and the dynamic local search algorithm SAPS Hutter et al. (2002) (a baseline).
For MIP, we used two instance distributions from computational sustainability (RCW and CORLAT), one from winner determination in combinatorial auctions (REG), two unions of these (CR := CORLAT RCW and CRR := CORLAT REG RCW), and a large and diverse set of publicly available MIP instances (BIGMIX). Details about these distributions are given in A.2. We used the two state-of-the-art commercial solvers CPLEX International Business Machines Corp. (2012) and Gurobi Gurobi Optimization Inc. (2012) (versions 12.1 and 2.0, respectively) and the two strongest non-commercial solvers, SCIP Berthold et al. (2012) and lp_solve Berkelaar et al. (2012) (versions 126.96.36.199 and 5.5, respectively).
For TSP, we used three instance distributions (detailed in A.3): random uniform Euclidean instances (RUE), random clustered Euclidean instances (RCE), and TSPLIB, a heterogeneous set of prominent TSP instances. On these instance sets, we ran the state-of-the-art systematic and local search algorithms, Concorde Applegate et al. (2006) and LK-H Helsgaun (2000). For the latter, we computed runtimes as the time required to find an optimal solution.
6.2 Experimental Setup
To collect algorithm runtime data, for each algorithm–distribution pair, we executed the algorithm using default parameters on all instances of the distribution, measured its runtimes, and collected the results in a database. All algorithm runs were executed on a cluster of 55 dual 3.2GHz Intel Xeon PCs with 2MB cache and 2GB RAM, running OpenSuSE Linux 11.1; runtimes were measured as CPU time on these reference machines. We terminated each algorithm run after one CPU hour; this gave rise to capped runtime observations, because for each run that was terminated in this fashion, we only observed a lower bound on the runtime. Like most past work on runtime modeling, we simply counted such capped runs as having taken one hour. (In Section 9 we investigate alternatives and conclude that a better treatment of capped runtime data improves predictive performance for our best-performing model.) Basic statistics of the resulting runtime distributions are given in Table 3; Table C.1 in the online appendix lists all the details.
We evaluated different model families by building models on a subset of the data and assessing their performance on data that had not been used to train the models. This can be done visually (as, e.g., in the scatterplots in Figure 4 on Page 4, which show cross-validated predictions for a random subset of up to 1 000 data points), or quantitatively. We considered three complementary quantitative metrics to evaluate mean predictions and predictive variances given true performance values . Root mean squared error (RMSE) is defined as ; Pearson’s correlation coefficient (CC) is defined as , where and denote sample mean and standard deviation of ; and log likelihood (LL) is defined as , where
denotes the probability density function (PDF) of a standard normal distribution. Intuitively, LL is the log probability of observing the true valuesunder the predicted distributions . For CC and LL, higher values are better, while for RMSE lower values are better. We used -fold cross-validation and report means of these measures across the folds. We assessed the statistical significance of our findings using a Wilcoxon signed-rank test (we use this paired test, since cross-validation folds are correlated).
6.3 Predictive Quality
|RMSE||Time to learn model (s)|
Ridge regression (RR)
Table 2 provides quantitative results for all benchmarks, and Figure 4 visualizes results. At the broadest level, we can conclude that most of the methods were able to capture enough about algorithm performance on training data to make meaningful predictions on test data, most of the time: easy instances tended to be predicted as being easy, and hard ones as being hard. Take, for example the case of predicting the runtime of Minisat 2.0 on a heterogeneous mix of SAT competition instances (see the leftmost column in Figure 2 and the top row of Table 2). Minisat 2.0 runtimes varied by almost six orders of magnitude, while predictions with the better models rarely were off by more than one order of magnitude (outliers may draw the eye in the scatterplot, but quantitatively, the RMSE for predicting runtime was low – e.g., 0.47 for random forests, which means an average misprediction of a factor of ). While the models were certainly not perfect, note that even the relatively poor predictions of ridge regression variant RR tended to be accurate within about an order of magnitude, which was enough to enable the portfolio-based algorithm selector SATzilla Xu et al. (2008) to win five medals in each of the 2007 and 2009 SAT competitions. (Switching to random forest models after 2009 further improved SATzilla’s performance Xu et al. (2012a).)
In our experiments, random forests were the overall winner among the different methods, yielding the best predictions in terms of all our quantitative measures.999For brevity, we only report RMSE values in the tables here; comparative results for correlation coefficients and log likelihoods, given in Table D.3 in the online appendix, are qualitatively similar. For SAT, random forests were always the best method, and for MIP they yielded the best performance for the most heterogeneous instance set, BIGMIX (see Column 2 of Figure 4). We attribute the strong performance of random forests on highly heterogeneous data sets to the fact that, as a tree-based approach, they can model very different parts of the data separately; in contrast, the other methods allow the fit in a given part of the space to be influenced more by data in distant parts of the space. Indeed, the ridge regression variants made extremely bad predictions for some outlying points on BIGMIX. For the more homogeneous MIP data sets, either random forests or projected processes performed best, often followed closely by ridge regression variant RR. The performance of CPLEX on set RCW was a special case in that it could be predicted extremely well by all models (see Column 3 of Figure 4). Finally, for TSP, projected processes and ridge regression had a slight edge for the homogeneous RUE and RCE benchmarks, whereas tree-based methods (once again) performed best on the most heterogeneous benchmark, TSPLIB. The last column of Figure 4 shows that, in the case where random forests performed worst, the qualitative differences in predictions were small. In terms of computational requirements, random forests were among the cheapest methods, taking between and seconds to learn a model.
6.4 Results based on Different Classes of Instance Feature
While the previous experiments focussed on the performance of the various models based on our entire feature set, we now study the performance of different subsets of features when using the overall best-performing model, random forests. Table 3 presents the results and also lists the cost of the various feature subsets (which in most cases is much smaller than the runtime of the algorithm being modeled). On the broadest level, we note that predictive performance improved as we used more computationally expensive features: e.g., while the trivial features were basically free, they yielded rather poor performance, whereas using the entire feature set almost always led to the best performance. Interestingly, however, for all SAT benchmarks, using at most moderately expensive features yielded results statistically insignificantly different from the best, with substantial reductions in feature computation time. The same was even true for several SAT benchmarks when considering at most cheap features. Our new features clearly showed value: for example, our cheap feature set yielded similar predictive performance as the set of previous features at a much lower cost; and our moderate feature set tended to yield better performance than the previous one at comparable cost. Our new features led to especially clear improvements for MIP, yielding significantly better predictive performance than the previous features in 11/12 cases. Similarly, for TSP, our new features improved performance significantly in 4/6 cases (TSPLIB was too small to achieve reliable results in the two remaining cases, with even the trivial features performing insignificantly worse than the best).
|Alg. runtime||RMSE||Avg. feature time [s]||Max. feature time [s]|