1 Introduction
Algorithms for solving hard combinatorial problems often rely on random choices and decisions to improve their performance. For example, randomization helps to escape local optima, enforces stronger exploration and diversifies the search strategy by not only relying on heuristic information. In particular, most local search algorithms are randomized
[Hoos and Stützle2004] and structured treebased search algorithms can also substantially benefit from randomization [Gomes et al.2000].The runtimes of randomized algorithms for hard combinatorial problems are wellknown to vary substantially, often by orders of magnitude, even when running the same algorithm multiple times on the same instance [Gomes et al.2000, Hoos and Stützle2004, Hurley and O’Sullivan2015]. Hence, the central object of interest in the analysis of a randomized algorithm on an instance is its runtime distribution (RTD), in contrast to a single scalar for deterministic algorithms. Knowing these RTDs is important in many practical applications, such as computing optimal restart strategies [Luby et al.1993], optimal algorithm portfolios [Gomes and Selman2001] and the speedups obtained by executing multiple independent runs of randomized algorithms [Hoos and Stützle2004].
It is trivial to measure an algorithm’s empirical RTD on an instance by running it many times to completion, but for new instances this is of course not practical. Instead, one would like to estimate the RTD for a new instance
without running the algorithm on it.There is a rich history in AI that shows that the runtime of algorithms for solving hard combinatorial problems can indeed be predicted to a certain degree [Brewer1995, Roberts and Howe2007, Fink1998, LeytonBrown et al.2009, Hutter et al.2014]. These runtime predictions have enabled a wide range of metaalgorithmic procedures, such as algorithm selection [Xu et al.2008], modelbased algorithm configuration [Hutter et al.2011], generating hard benchmarks [LeytonBrown et al.2009], gaining insights into instance hardness [SmithMiles and Lopes2012] and algorithm performance [Hutter et al.2013], and creating cheaptoevaluate surrogate benchmarks [Eggensperger et al.2018].
Given a method for predicting RTDs of randomized algorithms, all of these applications could be extended by an additional dimension. Indeed, predictions of RTDs have already enabled applications such as dynamic algorithm portfolios [Gagliolo and Schmidhuber2006b], adaptive restart strategies [Gagliolo and Schmidhuber2006a, Haim and Walsh2009], and predictions of the runtime of parallelized algorithms [Arbelaez et al.2016]. To advance the underlying foundation of these applications, in this paper we focus on better methods for predicting RTDs. Specifically, our contributions are as follows:

We compare different ways of predicting RTDs and demonstrate that neural networks (NNs) can jointly predict all parameters of various parametric RTDs, yielding RTD predictions that are superior to those of previous approaches (which predict the RTD’s parameters independently).

We propose DistNet, a practical NN for predicting RTDs, and discuss the bells and whistles that make it work.

We show that DistNet achieves substantially better performance than previous methods when trained only on a few observations per training instance.
2 Related Work
The rich history in predicting algorithm runtimes focuses on predicting mean runtimes, with only a few exceptions. huttercp06a (huttercp06a) predicted the single distribution parameter of an exponential RTD and arbelaezictai16a (arbelaezictai16a) predicted the two parameters of lognormal and shifted exponential RTDs with independent models. In contrast, we jointly predict multiple RTD parameters (and also show that the resulting predictions are better than those by independent models).
The work most closely related to ours is by gaglioloicann05a (gaglioloicann05a), who proposed to use NNs to learn a distribution of the time left until an algorithm solves a problem based on features describing the algorithm’s current state and the problem to be solved; they used these predictions to dynamically assign time slots to algorithms. In contrast, we use NNs to predict RTDs for unseen problem instances.
All existing methods for predicting runtime on unseen instances base their predictions on instance features that numerically characterize problem instances. In particular in the context of algorithm selection, these instance features have been proposed for many domains of hard combinatorial problems, such as propositional satisfiability [Nudelman et al.2004] and AI planning [Fawcett et al.2014]. To avoid this manual step of feature construction, loreggiaaaai16 (loreggiaaaai16) proposed to directly use the text format of an instance as the input to a NN to obtain a numerical representation of the instance. Since this approach performed a bit worse than manually constructed features, in this work we use traditional features, but in principle our framework works with any type of features.
3 Problem Setup
The problem we address in this work can be formally described as follows:
Problem Statement (Predicting RTDs).
Given

a randomized algorithm

a set of instances

for each instance :

instance features

runtime observations obtained by executing on with different seeds,

the goal is to learn a model that can predict ’s RTD well for unseen instances with given features .
Following the typical approach in the literature [Hutter et al.2006, Arbelaez et al.2016], we address this problem in two steps:

Determine a parametric family of RTDs with parameters that fits well across training instances;

Fit a machine learning model that, given a new instance and its features, predicts ’s parameters on that instance.
Figure 2 illustrates the pipeline we use for training these RTD predictors and using them on new instances.
3.1 Parametric Families of RTDs
Distribution  Param.  

Normal (N)  
Lognormal (LOG)  
Exponential (EXP)  
Inverse Normal (INV) 
We considered a set of
parametric probability distributions (shown in Table
1 with exemplary instantiations shown in Figure 2), most of which have been widely studied to describe the RTDs of combinatorial problem solvers [Frost et al.1997, Gagliolo and Schmidhuber2006a, Hutter et al.2006].First, we considered the Normal distribution (N) as a baseline, due to its widespread use throughout the sciences.
Since the runtimes of hard combinatorial solvers often vary on an exponential scale (likely due to the hardness of the problems studied), a much better fit of empirical RTDs is typically achieved by a lognormal distribution (LOG); this distribution is attained if the logarithm of the runtimes is normaldistributed and has been shown to fit empirical RTDs well in previous work [Frost et al.1997].
Another popular parametric family from the literature on RTDs is the exponential distribution (EXP), which tends to describe the RTDs of many wellbehaved stochastic local search algorithms well
[Hoos and Stützle2004]. It is the unique family with the property that the probability of finding a solution in the next time interval (conditional on not having found one yet) remains constant over time.By empirically studying a variety of alternative parametric families, we also found that an inverse Normal distribution (INV) tends to fit RTDs very well. By setting its parameter close to infinity, it can also be made to resemble a normal distribution. Like LOG and EXP, this flexible distribution can model the relatively long tails of typical RTDs of randomized combinatorial problem solvers quite well.
3.2 Quantifying the Quality of RTDs
To measure how well a parametric distribution with parameters fits our empirical runtime observations (the empirical RTD), we use the likelihood of parameters given all observations , which is equal to the probability of the observations under distribution with parameters :
(1) 
Consequently, when estimating the parameters of a given empirical RTD, we use a maximumlikelihood fit. For numerical reasons, as is common in machine learning, we use the negative loglikelihood (NLLH) as a loss function to be minimized:
(2) 
Since each instance results in an RTD, we measure the quality of a parametric family of RTDs for a given instance set by averaging over the NLLHs of all instances.
One problem of Eq. (2) is that it weights easy instances more heavily: if two RTDs have the same shape but differ in scale by a factor of 10 due to one instance being 10 times harder, the PDF for the easier instance is 10 times larger (to still integrate to 1). To account for that, for each instance, we multiply the likelihoods with the maximal observed runtime, and use the resulting metric to select a distribution family for a dataset at hand and to compare the performance of our models:
(3)  
(4) 
4 Joint Prediction of multiple RTD Parameters
Having selected a parametric family of distributions, the last part of our pipeline is to fit an RTD predictor for new instances as formally defined in Section 3. In the following, we briefly discuss how traditional regression models have been used for this problem, and why this optimizes the wrong loss function. We then show how to obtain better predictions with NNs and introduce DistNet for this task.
4.1 Generalizing from Training RTDs
A straightforward approach for predicting parametric RTDs based on standard regression models is to fit the RTD’s parameters for each training instance , and to then train a regression model on data points that directly maps from instance features to RTD parameters. There are two variants to extend these approaches to the problem of predicting multiple parameters of RTDs governed by parameters (e.g. and for LOG): (1) fitting independent regression models, or (2) fitting a multioutput model with outputs. These approaches have been used before based on Gaussian processes [Hutter et al.2006]
[Arbelaez et al.2016]and random forests
[Hutter et al.2014, Hurley and O’Sullivan2015]However, we note that these variants measure loss in the space of the distribution parameters as opposed to the true loss in Equation (2) and that both variants require fitting RTDs on each training instance, making the approach inapplicable if we, e.g., only have access to a few runs for each of a thousands of instances. Now, we show how NNs can be used to solve both of these problems.
4.2 Predictions with Neural Networks
NNs have recently been shown to achieve stateoftheart performance for many supervised machine learning problems as large data sets became available, e.g., in image classification and segmentation, speech processing and natural language processing. For a thorough introduction, we refer the interested reader to goodfellowmit16a (goodfellowmit16a). Here, we apply NNs to RTD prediction.
Background on Neural Networks.
NNs can approximate arbitrary functions by defining a mapping where are the weights to be learnt during training to approximate the function. In this work we use a fullyconnected feedforward network, which can be described as an acyclic graph that connects nonlinear transformations in a chain, from layer to layer. For example, a NN with two hidden layers that predicts for some input can be written as:^{1}^{1}1We ignore bias terms for simplicity of exposition.
(5) 
with denoting trainable network weights and
(the socalled activation function) being a nonlinear transformation applied to the weighted outputs of the
th layer. We use the activation function for to constrain all outputs to be positive.NNs are usually trained with stochastic gradient descent (SGD) methods using backpropagation to effectively obtain gradients of a taskspecific loss function for each weight.
Neural Networks for predicting RTDs.
We have one input neuron for each instance feature, and we have one output neuron for each distribution parameter. To this end, we assume that we know the bestfitting distribution family from the previous step of our pipeline.
We train our networks to directly minimize the NLLH of the predicted distribution parameters given our observed runtimes. Formally, for a given set of observed runtimes and instance features, we minimize the following loss function in an endtoend fashion:
(6) 
Here, denotes the values of the distribution parameters obtained in the output layer given an instantiation of the NN’s weights. This optimization process, which targets exactly our loss function of interest (Eq. (2)), allows to effectively predict all distribution parameters jointly. Since predicted combinations are judged directly by their resulting NLLH, the optimization process is driven to find combinations that work well together. This endtoend optimization process is also more general as it removes the need of fitting an RTD on each training instance and thereby enables using an arbitrary set of algorithm performance data for fitting the model.
DistNet: RTD predictions with NNs in practice.
Unfortunately, training an accurate NN in practice can be tricky and requires manual attention to many details, including the network architecture, training procedure, and other hyperparameter settings.
Specifically, to preprocess our runtime data , we performed the following steps:

We removed all (close to) constant features.

For each instance feature type, we imputed missing values (caused by limitations during feature computation) by the median of the known instance features.

We scaled the observed runtimes in a range of by dividing it by the maximal observed runtime across all instances. This also helps the NN training to converge faster.
For training DistNet, we considered the following aspects:

Our first networks tended to overfit the training data if the training data set was too small and the network too large. Therefore we chose a fairly small NN with hidden layers each with neurons. In preliminary experiments we found that larger networks tend to achieve slightly better performance on our largest datasets, but we decided to strive for simplicity.

We considered each runtime observation as an individual data sample.

We shuffled the runtime observations (as opposed to, e.g., using only data points from a single instance in each batch) and used a fairly small batch size of to reduce the correlation of the training data points in each batch.

Our loss function can have very large gradients because slightly suboptimal RTD parameters can lead to likelihoods close to zero (or a very large NLLH). Therefore, we used a fairly small initial learning rate of exponentially decaying to
and used gradient clipping
[Pascanu et al.2014] on top of it.
Besides that, we used common architectural and parameterization choices: tanh
as an activation function, SGD for training, batch normalization, and a
regularization of . We call the resulting neural network DistNet.5 Experiments
In our experiments, we study the following research questions:
 Q1

Which of the parametric RTD families we considered best describe the empirical RTDs of the SAT and AI planners we study?
 Q2

How do DistNet’s joint predictions of RTD parameters compare to those of popular random forest models?
 Q3

Can DistNet learn to predict entire RTDs based on training data that only contains a few observed runtimes for each training instance?
5.1 Experimental Setup
Scenario  #instances  #features  cutoff [sec] 

Claspfactoring  2000  102  5000 
SapsCVVAR  10011  46  60 
SpearQCP  8076  91  5000 
YalSATQCP  11747  91  5000 
SpearSWGCP  11182  76  5000 
YalSATSWGCP  11182  76  5000 
LPGZenotravel  3999  165  300 
We focus on predicting the RTDs of wellstudied algorithms, each evaluated on a different set of problem instances from two different domains:
 Claspfactoring

is based on the treebased CDCL solver Clasp [Gebser et al.2012] which we ran on SATencoded factorization problems instances.
 SapsCVVAR

is based on the local search SAT solver Saps [Hutter et al.2002]. The SAT instances are randomly generated with a varying clausevariable ratio.
 SpearSWGCP/YalSATSWGCP

are based on the treesearch SAT solver Spear [Babić and Hutter2007] and on the local search SAT solver YalSAT [Biere2014], a combination of different variants of ProbSAT [Balint and Schöning2012]. The instances are SATencoded small world graph coloring problems [Gent et al.1999].
 SpearQCP/YalSATQCP

are based on the same solvers as SpearSWGCP and YalSATSWGCP. The SAT instances encode quasigroup completion instances [Gomes and Selman1997].
 LPGZenotravel

is based on the local search AIplanning solver LPG [Gerevini and Serina2002]. The instances are from the zenotravel planning domain [Penberthy and Weld1994], which arise in a version of route planning.
To gather training data, we ran each algorithm (with default parameters) with different seeds on each instance^{4}^{4}4We removed instances for which no instance features could be computed and only considered instances which could always be solved within a cutoff limit.. This resulted in the 7 datasets shown in Table 2.
We used the opensource neural network library keras [Chollet et al.2015] for our NN, scikitlearn [Pedregosa et al.2011] for the RF implementation, and scipy [Jones et al.2001] for fitting the distributions. ^{5}^{5}5Code and data can be obtained from here:
http://www.ml4aad.org/distnet
5.2 Q1: Best RTD Families
Claspfactoring  SapsCVVAR  LPGZenotravel  
SpearQCP  YalSATQCP  SpearSWGCP  YalSATSWGCP 
Figure 3 shows some exemplary CDFs of our empirical RTDs; each line is the RTD on one of the instances. The different algorithms’ RTDs show different characteristics with most instances having a long right tail and a short left tail. The RTDs of Claspfactoring in contrast have a short left and right tail. Also, for some instances from SapsCVVAR and SpearSWGCP the runtimes were very short and similar, causing almost vertical CDFs.
Table 3 shows a quantitative evaluation of the different RTD families we considered. Next to the normalized NLLH (see Equation (4
)), we followed arbelaezictai16a (arbelaezictai16a) and evaluated the KolmogorovSmirnov (KS) test as a goodness of fit statistical test. The KSstatistic is based on the maximal distance between an empirical distribution and the cumulative distribution function of a reference distribution. To aggregate the test results across instances, we count how often the KStest rejected the nullhypothesis that our measured
are drawn from a reference RTD.KS: (%  

Claspfactoring  INV  LOG  N  EXP  INV  LOG  N  EXP 
0.35  0.35  0.29  0.29  12.0  10.2  15.0  100  
SapsCVVAR  LOG  INV  N  EXP  LOG  INV  N  EXP 
0.88  0.88  0.75  0.26  0.1  4.0  20.1  87.5  
SpearQCP  LOG  EXP  INV  N  LOG  EXP  INV  N 
1.20  1.14  1.10  0.41  1.1  22.6  52.1  99.3  
YalSATQCP  LOG  INV  EXP  N  LOG  INV  EXP  N 
0.78  0.78  0.66  0.32  0.0  6.8  46.5  80.1  
SpearSWGCP  LOG  INV  EXP  N  LOG  INV  EXP  N 
0.93  0.90  0.71  0.41  15.2  26.7  24.5  79.0  
YalSATSWGCP  LOG  INV  EXP  N  LOG  INV  EXP  N 
0.94  0.91  0.89  0.30  0.0  25.3  14.0  98.0  
LPGZenotravel  LOG  INV  N  EXP  LOG  INV  N  EXP 
0.90  0.90  0.62  0.08  12.7  20.2  79.2  100 
Overall, the best fitted distributions closely resembled the true empirical RTDs, with a rejection rate of the KStest of at most 15.2%. Hence, on most instances the best fitted distributions were not statistically significantly different from the empirical ones. For most scenarios the lognormal distribution (LOG) performed best, closely followed by the inverse Normal (INV). The Normal (N) and exponential (EXP) distributions performed worse for all scenarios. On
SpearSWGCP, the KStest showed the most statistically significant differences for the best fitting distribution since the RTDs for some instances only have a small variance (see Figure 3) and cannot be perfectly approximated by the distributions we considered. Still, these distributions achieved good NLLH values.5.3 Q2: Predicting RTDs
Next, we turn to the empirical evaluation of our DistNet and compare it to previous approaches. Since random forests (RFs) have been shown to perform very well for standard runtime prediction tasks [Hutter et al.2014, Hurley and O’Sullivan2015], we experimented with them in two variants: fitting a multioutput RF (mRF) and fitting multiple independent RFs, one for each distribution parameter (iRF). We trained DistNet as described in Section 4.2
and limit the training to take at most 1h or 1000 epochs, whichever was less. As a gold standard, we report the NLLH obtained by a maximum likelihood fit to the empirical RTD (”fitted” in Table
3).Scenario  dist  fitted  iRF  mRF  DistNet  

Claspfactoring  INV  train  0.35  0.26  0.28  0.24 
test  0.35  0.04  0.09  0.16  
LOG  train  0.35  0.30  0.30  0.24  
test  0.35  0.14  0.13  0.14  
SapsCVVAR  LOG  train  0.88  0.66  0.68  0.54 
test  0.88  0.99  0.29  0.52  
INV  train  0.88  0.46  0.57  0.54  
test  0.88  0.22  0.09  0.54  
SpearQCP  LOG  train  1.20  1.09  1.13  1.11 
test  1.20  1.00  0.96  1.10  
EXP  train  1.14  1.05  1.05  0.93  
test  1.14  0.88  0.88  0.91  
YalSATQCP  LOG  train  0.78  0.50  0.77  0.76 
test  0.78  0.49  0.74  0.75  
INV  train  0.78  0.68  0.77  0.74  
test  0.78  0.66  0.73  0.74  
SpearSWGCP  LOG  train  0.93  2.46  0.23  0.48 
test  0.93  0.82  0.26  0.47  
INV  train  0.90  3.60  3.32  0.33  
test  0.90  3.27  2.58  0.32  
YalSATSWGCP  LOG  train  0.94  0.81  0.88  0.81 
test  0.94  0.69  0.71  0.81  
INV  train  0.91  0.80  0.86  0.76  
test  0.91  0.68  0.69  0.76  
LPGZenotravel  LOG  train  0.90  0.89  0.89  0.85 
test  0.90  0.85  0.84  0.85  
INV  train  0.90  0.84  0.87  0.84  
test  0.90  0.72  0.80  0.84 
Table 4 shows the NLLH achieved using a fold crossvalidation, i.e., we split the instances into disjoint sets, train our models on all but one subset and measure the test performance on the left out subset. We report the average performance (see Eq.(4)) on train and test data across all splits for the two best fitting distributions from Table 3.
Overall our results show that it is possible to predict RTD parameters for unseen instances, and that DistNet performed best. For out of datasets, our models achieved a NLLH close to the gold standard of fitting the RTDs to the observed data. Also, for out of datasets both distribution families were similarly easy to predict for all models.
For the RFbased models, we observed slight overfitting for most scenarios. For DistNet, we only observed this on the smallest data set: Claspfactoring. On SpearSWGCP, both the iRF and mRF yielded poor performance for both distributions as they failed to predict distribution parameters for instances with a very short runtime and thus receive a high NLLH on these instances. In general the iRF yielded worse performance than mRF demonstrating that distribution parameter should be learned jointly. Overall, DistNet yielded the most robust results. It achieved the best test set predictions for all cases, sometimes with substantial improvements over the RF baselines.
5.4 Q3: DistNet on a Low Number of Observations
Finally, we evaluated the performance of DistNet wrt. the number of observed runtimes per instance. Fewer observations per instance result in smaller training data sets for DistNet, whereas the data set size for the mRF stays the same with the distribution parameters being computed on fewer samples. We evaluated DistNet and mRF in the same setup as before, using a fold crossvalidation, but repeating each evaluation times with a different set of sampled observations. Figure 4 reports the achieved NLLH for LOG as a function of the number of training samples for two representative scenarios. We observed similar results for all scenarios.
Overall, our results show that the predictive quality of mRF relies on the quality of the fitted distributions used as training data whereas DistNet can achieve better results as it directly learns from runtime observations. On LPGZenotravel, for which all models performed competitively when using observations (see Table 4), DistNet achieved a better performance with fewer data converging to a similar NLLH value as the RF when using all available observations.
On the larger dataset, YalSATQCP, DistNet converged with samples per instance yielding a predictive performance better than that of mRFs with 100 iterations. Collecting only 16 instead of 100 samples would speed up the computation by more than 6fold, which, e.g., for YalSATQCP would save almost 2,000 CPU hours.
LPGZenotravel  YalSATQCP  

6 Conclusion and Future Work
In this paper we showed that NNs can be used to jointly learn distribution parameters to predict RTDs. In contrast to previous RF models, we train our model on individual runtime observations, removing the need to first fit RTDs on all training instances. More importantly, our NN – which we dub DistNet – directly optimizes the loss function of interest in an endtoend fashion, and by doing so obtains better predictive performance than previouslyused RF models that do not directly optimize this loss function. Because of that our model can learn meaningful distribution parameters from only few observations per training instance and therefore requires only a fraction of training data compared to previous approaches.
Overall, our methodology allows for better RTD predictions and therefore may pave the way for improving many applications that currently rely mostly on mean predictions only, such as, e.g., algorithm selection and algorithm configuration.
Currently, our method assumes large homogeneous instance sets without censored observations. We consider further extensions as future work, such as handling censored observations (e.g., timeouts) in the loss function of our NN [Gagliolo and Schmidhuber2006a], using a mixture of models [Jacobs et al.1991]
to learn different distribution families, or studying nonparametric models (which can fit arbitrary distributions and thus do not require prior knowledge of the type of runtime distribution). Finally, in many applications the algorithm’s configuration is a further important dimension for predicting RTDs, and we therefore plan to handle this as an additional input in future versions of DistNet.
Acknowledgements
The authors acknowledge funding by the DFG (German Research Foundation) under Emmy Noether grant HU 1900/21 and support by the state of BadenWürttemberg through bwHPC and the DFG through grant no INST 39/9631 FUGG. K. Eggensperger additionally acknowledges funding by the State Graduate Funding Program of BadenWürttemberg.
References
 [Arbelaez et al.2016] A. Arbelaez, C. Truchet, and B. O’Sullivan. Learning sequential and parallel runtime distributions for randomized algorithms. In Proc. of ICTAI’16, pages 655–662, 2016.
 [Babić and Hutter2007] D. Babić and F. Hutter. Spear theorem prover, 2007. Solver description, SAT competition.
 [Balint and Schöning2012] A. Balint and U. Schöning. Choosing probability distributions for stochastic local search and the role of make versus break. In Proc. of SAT’12, pages 16–29, 2012.
 [Biere2014] A. Biere. Yet another local search solver and lingeling and friends entering the SAT competition 2014. In Proc. of SAT Competition 2014, pages 39–40, 2014.
 [Brewer1995] E. Brewer. Highlevel optimization via automated statistical modeling. In ACM SIGPLAN Notices, pages 80–91, 1995.
 [Chollet et al.2015] Chollet et al. Keras. https://github.com/fchollet/keras, 2015.
 [Eggensperger et al.2018] K. Eggensperger, M. Lindauer, H. Hoos, F. Hutter, and K LeytonBrown. Efficient benchmarking of algorithm configuration procedures via modelbased surrogates. Machine Learning, 107:15–41, 2018.
 [Fawcett et al.2014] C. Fawcett, M. Vallati, F. Hutter, J. Hoffmann, H. Hoos, and K. LeytonBrown. Improved features for runtime prediction of domainindependent planners. In Proc. of ICAPS’14, pages 355–359, 2014.
 [Fink1998] E. Fink. How to solve it automatically: Selection among problemsolving methods. In Proc. of ICANN’08, pages 128–136, 1998.
 [Frost et al.1997] D. Frost, I. Rish, and L. Vila. Summarizing CSP hardness with continuous probability distributions. In Proc. of AAAI’97, pages 327–333, 1997.
 [Gagliolo and Schmidhuber2005] M. Gagliolo and J. Schmidhuber. A neural network model for interproblem adaptive online time allocation. In Proc. of ICANN’05, pages 752–752, 2005.
 [Gagliolo and Schmidhuber2006a] M. Gagliolo and J. Schmidhuber. Impact of censored sampling on the performance of restart strategies. In Proc. of CP’06, pages 167–181, 2006.
 [Gagliolo and Schmidhuber2006b] M. Gagliolo and J. Schmidhuber. Learning dynamic algorithm portfolios. AMAI, 47(34):295–328, 2006.
 [Gebser et al.2012] M. Gebser, B. Kaufmann, and T. Schaub. Conflictdriven answer set solving: From theory to practice. AI, 187188:52–89, 2012.
 [Gent et al.1999] I. Gent, H. Hoos, P. Prosser, and T. Walsh. Morphing: Combining structure and randomness. In Proc. of AAAI’99, pages 654–660, 1999.
 [Gerevini and Serina2002] A. Gerevini and I. Serina. LPG: A planner based on local search for planning graphs with action costs. In Proc. of AIPS’02, pages 13–22, 2002.
 [Gomes and Selman1997] C. Gomes and B. Selman. Problem structure in the presence of perturbations. In Proc. of AAAI’97, pages 221–226, 1997.
 [Gomes and Selman2001] C. Gomes and B. Selman. Algorithm portfolios. AIJ, 126(12):43–62, 2001.
 [Gomes et al.2000] C. Gomes, B. Selman, N. Crato, and H. Kautz. Heavytailed phenomena in satisfiability and constraint satisfaction problems. JAR, 24:67–100, 2000.
 [Goodfellow et al.2016] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016.
 [Haim and Walsh2009] S. Haim and T. Walsh. Restart strategy selection using machine learning techniques. In Proc. of SAT’09, pages 312–325, 2009.
 [Hoos and Stützle2004] H. Hoos and T. Stützle. Stochastic Local Search: Foundations & Applications. Morgan Kaufmann Publishers Inc., 2004.
 [Hurley and O’Sullivan2015] B. Hurley and B. O’Sullivan. Statistical regimes and runtime prediction. In Proc. of IJCAI’15, pages 318–324, 2015.
 [Hutter et al.2002] F. Hutter, D. Tompkins, and H. Hoos. Scaling and probabilistic smoothing: Efficient dynamic local search for SAT. In Proc.of CP’02, pages 233–248, 2002.
 [Hutter et al.2006] F. Hutter, Y. Hamadi, H. Hoos, and K. LeytonBrown. Performance prediction and automated tuning of randomized and parametric algorithms. In Proc. of CP’06, pages 213–228, 2006.
 [Hutter et al.2011] F. Hutter, H. Hoos, and K. LeytonBrown. Sequential modelbased optimization for general algorithm configuration. In Proc. of LION’11, pages 507–523, 2011.
 [Hutter et al.2013] F. Hutter, H. Hoos, and K. LeytonBrown. Identifying key algorithm parameters and instance features using forward selection. In Proc. of LION’13, pages 364–381, 2013.
 [Hutter et al.2014] F. Hutter, L. Xu, H. Hoos, and K. LeytonBrown. Algorithm runtime prediction: Methods and evaluation. AIJ, 206:79–111, 2014.
 [Jacobs et al.1991] R. Jacobs, M. Jordan, S. Nowlan, and G. Hinton. Adaptive mixtures of local experts. Neural Computation, 3(1):79–87, 1991.
 [Jones et al.2001] E. Jones, T. Oliphant, and P. Peterson et al. SciPy: Open source scientific tools for Python. http://www.scipy.org/, 2001.
 [LeytonBrown et al.2009] K. LeytonBrown, E. Nudelman, and Y. Shoham. Empirical hardness models: Methodology and a case study on combinatorial auctions. Journal of ACM, 56(4):1–52, 2009.
 [Loreggia et al.2016] A. Loreggia, Y. Malitsky, H. Samulowitz, and V. Saraswat. Deep learning for algorithm portfolios. In Proc. of AAAI’16, pages 1280–1286, 2016.
 [Luby et al.1993] M. Luby, A. Sinclair, and D. Zuckerman. Optimal speedup of las vegas algorithms. Inf. Process. Lett., pages 173–180, 1993.
 [Nudelman et al.2004] E. Nudelman, K. LeytonBrown, A. Devkar, Y. Shoham, and H. Hoos. Understanding random SAT: Beyond the clausestovariables ratio. In Proc. of CP’04, pages 438–452, 2004.

[Pascanu et al.2014]
R. Pascanu, T. Mikolov, and Y. Bengio.
On the difficulty of training recurrent neural networks.
In Proc. of ICML’13, pages 1310–1318, 2014.  [Pedregosa et al.2011] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikitlearn: Machine learning in Python. JMLR, 12:2825–2830, 2011.
 [Penberthy and Weld1994] J. Penberthy and D. Weld. Temporal planning with continuous change. In Proc. of AAAI’94, pages 1010–1015, 1994.
 [Roberts and Howe2007] Mark Roberts and Adele Howe. Learned models of performance for many planners. In ICAPS 2007 Workshop AI Planning and Learning, 2007.

[SmithMiles and Lopes2012]
K. SmithMiles and L. Lopes.
Measuring instance difficulty for combinatorial optimization problems.
Computers and Operations Research, 39(5):875–889, 2012.  [Xu et al.2008] L. Xu, F. Hutter, H. Hoos, and K. LeytonBrown. SATzilla: Portfoliobased algorithm selection for SAT. JAIR, 32:565–606, 2008.
Comments
There are no comments yet.