Because optimization algorithms are usually highly parameterized, algorithm parameter tuning/configuration is an important task. Given a distribution of problem instances, we need to find parameter configurations that optimize a predefined performance measure over the distribution, such as mean of optimality gap. For the last fifteen years, automated algorithm configuration has been extensively studied . General-purpose automated algorithm configuration tools such as SMAC  and irace  have been successfully applied in several studies.
In this work, we consider the parameter tuning problem of a multi-neighborhood local search algorithm , which consists of a large number of neighborhoods. The algorithm is the winner of the Verolog Solver Challenge 2014 . At each iteration, a neighbor solution is generated by a randomly chosen neighborhood with a probability defined by a weight value in the range of [0,1]. Weights of all neighborhoods are fixed before the algorithm runs, and are considered as algorithm parameters. Given a set of six (large) instances provided by the challenge, automated algorithm configuration tools can be used to tune the algorithm parameters. However, the large number of parameters (28 real parameters for the weight values and 2 integer parameters for the local search) might deteriorate the tuning tool’s efficiency, especially in our case where each run of the algorithm is not computationally cheap (600 seconds per run for each instance). A potential solution is to cluster neighborhoods into groups and assign a common weight value to each. It can help to reduce the algorithm configuration space, hoping to make use of available tuning budgets more efficiently. The key question raised while doing such a clustering is how to characterize each neighborhood behaviours over a set of instances and represent them as a feature vector. In this paper, we propose a method to do so. This method is problem-independent and does not depend on any specific local search. Moreover, it can be done during stages of algorithm development, e.g., testing, manual/automated tuning.
This paper is organized as follows. We describe the tuning problem in more detail in section 2. The method for characterizing neighborhoods’ behaviours and clustering them is explained in section 3. Section 4 shows the advantage of using clustering in automated parameter tuning and experimental results. Finally, section 5 gives conclusion and discussion on future work.
2 Parameter tuning for a multi-neighborhood local search algorithm
The algorithm considered in this work, which was developed by CODeS group’s members of the University of Leuven (Belgium) , tackles the Swap-body Vehicle Routing problem. It is an iterated local search  algorithm that uses late acceptance hill climbing  as the local search component. At each iteration of the late acceptance hill climbing, a neighborhood is randomly chosen from a large set of neighborhoods, and a neighbor solution is generated according to . The probability that a neighborhood is chosen is proportional to its weight value . These weight values are fixed during each algorithm’s run, and sum up to one. In addition, there are two integer parameters that control the late acceptance hill climbing: this local search component is stopped after a number of consecutive iterations without any improvement on the current solution, and the parameter represents the size of the saved memory.
The algorithm consists of 42 neighborhoods, which were generated from 18 neighborhood types. Some of them are specially designed for the Swap-body Vehicle Routing problem (e.g., Convert-to-sub-route) while the others are taken from the Vehicle Routing Problem literature (e.g., Cheapest-insertion). Some neighborhood types can be parameterized by their sizes. For example: the size of a Cheapest-insertion neighborhood is defined by the number of customers that will be removed and re-inserted back into the current solution. We can have a small Cheapest-insertion neighborhood with the size of 2, and a large Cheapest-insertion neighborhood with the size of 25.
Intuition can be used to reduce the number of weights to 28: some neighborhoods that belong to the same neighborhood type and have similar sizes can be grouped into one. The list of neighborhood types and their groups of sizes are listed in Table 1. Parameter tuning is done on six (large) problem instances (large_normal, large_with, large_without, new_normal, new_with, new_without) provided by the competition. An algorithm run on each instance takes 600 seconds. Note that the algorithm considered in this paper is actually not the same as the one that won the competition. The winning one is multi-threaded (4 independent parallel runs) while the one we use here is single-threaded. This is because the aim of our work is not to beat the winning algorithm, but to use this case study as a proof of concept for our characterization method.
|Cheapest-insertion||1; 2; 3; 4; 5|
|Remove-chain||1; 2; 3; 4|
|Ejection-chain||3; 4; 5|
3 Neighborhood characterization and clustering
Inspired by the idea from OSCAR , which is an automated approach for online selection of algorithm portfolio, we characterize each neighborhood ’s behaviours on an instance based on the following six observables:
Probabilities that improves, worsens or does nothing on a solution of , denoted as , , , where
Magnitudes of improvement and worsening, denoted as and
’s running time (used for tie-breaking, as explained in section 3.3)
The novelty of our method is that we will represent
using the estimated values of those observables on differentsolution quality regions, as they reflect changes of ’s behaviours according to the hardness of the solution that it is dealing with. An illustration of such changes of , and for four neighborhoods on an instance is visualized in Figure 1. The -axis represents solution quality (the larger the value, the better the corresponding solution is) and the -axis represents values of the three observables. In order to draw those plots, we divide the range of solution quality into intervals, collect necessary information during algorithm runs, and group every ten intervals into one. Details on how to collect information for such a visualization will be described in section 3.1. In Figure 1c, we can see that when the solution quality is low, i.e., the local search is in easy-to-improve region of the solution quality space, the Merge-route neighborhood has a very high probability of improving the solution it is tackling. This probability drastically decreases when the neighborhood reaches a good solution quality region, and the probability of worsening the current solution starts reaching one from that point. On the other hand, the Remove-route neighborhood in Figure 1d shows a similar behaviour in the low-solution quality region. However, in the good-solution quality region, the neighborhood tends to preserve the quality of the current solution rather than worsen it. Even neighborhoods belonging to the same neighborhood type can behave differently in different regions. As shown in Figure 1a and 1b, the small Cheapest-insertion neighborhood with size 2 has a much smaller probability of worsening a solution in the hard-to-improve region compared with the large Cheapest-insertion neighborhood with size 25.
In the rest of this section, we introduce four steps to characterize and cluster neighborhoods. Firstly, necessary information is collected during algorithm’s runs. Then solution quality regions are automatically identified. Next, collected information on each region is aggregated to build neighborhoods’ feature vectors. Finally, we carry on cluster analysis.
3.1 Collect necessary information during algorithm runs
In this part, we describe the procedure of collecting all necessary information for characterizing neighborhood behaviours. Given a problem instance, we assume that an upper bound and a lower bound of the optimal solution quality are available. Since these bounds do not need to be tight, this assumption is not hard to be satisfied. For example, the upper bound could be obtained from a random solution or a solution generated by some greedy algorithm and the lower bound could result from solving a linear programming relaxation of the problem. In the algorithm considered in this work, the initial solution for each instance is produced by creating one route for each customer. We take that solution’s value as the upper bound. A lower bound for each instance is provided by the authors of the algorithm, as the best solution obtained from running their best algorithm configuration (the multi-threaded version) in six hours.
We divide the range between the upper bound and the lower bound into a large number of small intervals (here we set it as ). Because higher quality solutions in general are harder to improve, we let the size of the intervals decrease exponentially. Each next interval has a size the size of the previous interval.
Now every time a neighborhood is applied on a solution that has quality value belonging to an interval , the following values are accumulatively collected for the pair of (, ):
: the number of times is applied,
, , : the numbers of times improves, does nothing, or worsens solutions, respectively,
, : sums of the amount of improvement and worsening,
: sum of ’s running time.
Since the collection of these values is independent of algorithm configuration, it can be done during any algorithm runs, such as during testing, manual parameter tuning, or automated algorithm tuning. The more runs there are, the better the estimated values of the observables. In this work, we collect them from running two algorithm configurations on all instances, with 10 runs per instance, so the total number of algorithm runs is 240. We use a little bit longer running time (900 seconds per run) to make sure that the collected information can cover hard parts of the solution quality space.
3.2 Identify solution quality regions as frames
Intervals are grouped into frames based on , the sum of all neighborhoods’ values on each interval. Figure 2 shows plots of on each instance. Note that because lower bounds on solution quality are not reached, intervals with zero at the end are removed. In this figure, there is a high peak in every plot, representing the interval where the algorithm stays most of the time. We thus conjecture that local optima or plateau should lie there. We can interpret the solution quality regions with low values before that peak as easy-to-reach and easy-to-escape, whereas regions around that peak as easy-to-reach and hard-to-escape and regions after that peak as hard-to-reach. The smaller peaks of two instances new_with and large_with should indicate second local optima or plateau. We propose Algorithm 1 for grouping intervals into regions (frames) that tries to reflect such an interpretation.
Figure 3 shows identified frames with , which is used in our experiments, on the six provided instances.
3.3 Characterize neighborhood behaviours as feature vectors by aggregating collected information into frames
For the first three observables, , and , we just simply sum the three values , and for all intervals belonging to the same frame. We then divide them by the sum of to get the ratios. For the other two observables and , aggregation is more complicated. We can not sum or values over intervals and get the average due to the fact that their values are incomparable among different intervals. For example, we cannot say that an amount of improvement of in the two intervals and are equal since hardness of solutions belonging to those is probably not the same. Therefore, we translate them into ranks before doing aggregation. For each interval, neighborhoods are ranked based on the averages of their corresponding , values. Because ties can happen, e.g., some neighborhoods might never make any improvement in the hard solution quality regions, the average value of in the corresponding interval is used for tie-breaking. Since the intervals are fine, the resulting ranked lists are possibly:
noisy: at some intervals, of some neighborhoods might be very small, so that their estimated values of and might be inaccurate.
partial: of some neighborhoods might be equal to zero at some intervals, i.e., we do not have information of those neighborhoods in those intervals.
Therefore, we aggregate them using the R package RobustRankAggreg - a robust ranking aggregation method  specially designed for similar situations in bioinformatics. Eventually, for each neighborhood, we have a feature vector composing of 150 components, which is a combination of 5 observables, 5 frames and 6 instances.
3.4 Cluster analysis on neighborhoods
The first three observables, , and , sum up to one. As a result, their corresponding vector components belong to a special class named compositional data. As explained in  “sample space for compositional vectors is radically different from the real Euclidean space associated with unconstrained data”, multivariate statistical methodology designed for unconstrained data could not be applied directly. To convert them back to the Euclidean space, we apply the isometric log-ratio transformation proposed in . After the transformation, since the three observables are reduced to two, each feature vector is now 120-dimensional. We can start doing cluster analysis on neighborhoods based on those vectors. Since the number of dimensions () is larger than the number of individuals (
), the clustering method High-Dimensional Data Clustering (HDDC)), which is implemented in the R package HDclassif , is used for cluster analysis. This method has two desirable properties: the ability of dealing with high-dimensional low-sample data, and the optimal number of clusters automatically decided based on Bayesian Information Criterion. In the end, 42 neighborhoods are grouped into 9 clusters:
Ejection-chain 3, 4, 5; Remove-chain 1, 2, 3, 6, 7, 8; Remove-sub-route-with-cheapest-insertion;
Cheapest-insertion 10, 15, 20, 25, 35, 50; Each-sequence-cheapest-insertion (2,5), (4,4), (5,2); Remove-chain 4
Cheapest-insertion 1, 2, 3, 4, 5
Ejection-chain 10, 15, 35; Remove-chain 5; Intra-route-two-opt
Ruin-recreate 2, 3
Convert-to-route; Remove-sub-route; Remove-route; Split-to-sub-route
It might be interesting to have a look at some of the resulting clusters. The two neighborhoods Merge-route and Remove-route behaves quite differently in the second-half region as shown in Figure 1, and they are indeed clustered into two different groups. By taking a look into the neighborhoods’ implementation, we know that Add-sub-route and Convert-to-sub-route have an extreme behaviour when compared to the others: they will add an additional cost into the current solution and worsen it most of the time (it can also be seen in plots of their observables, which are similar to the ones shown in Figure 1). We can say that the cluster analysis does recognize this extremeness, as the two neighborhoods are grouped into a separated cluster. In addition to reflecting knowledge that can be guessed by looking at the neighborhoods’ implementation, the cluster analysis also does some grouping that is not intuitive from the neighborhoods’ structure, e.g., the grouping of Ejection-chain 10,15,35 and Intra-route-two-opt.
4 Experimental results
Our hypothesis is that the proposed characterization method does reflect neighborhood behaviours on the given set of instances, so that the generated feature vectors should correctly represent the neighborhoods and the clusters we obtained are meaningful. To test this hypothesis, we applied the automated tuning tool SMAC  to two configuration scenarios: the first one, dubbed basic, uses the 28 groups of neighborhoods described in Table 1, the second one, dubbed clustered uses the 9 clusters of neighborhoods generated from our characterization method. We carried out 18 runs of SMAC on each scenario. Each one has a budget of 2000 algorithm runs (13.9 CPU days). Due to the large CPU time each SMAC run requires, we use the shared-model-mode offered by SMAC with 10 cores (walltime is 1.39 days), and take the configuration which has the best training performance as the final one. Mean of optimality gaps (in percentage) on the instance set is used as tuning performance measure. Optimality gap on each instance is calculated by:
where is provided by the algorithm’s authors, and is the best solution cost obtained after running the multi-threaded version of the algorithm on the corresponding instance in 6 hours.
The best algorithm configuration from each SMAC run is evaluated using test performance, which is the mean of optimality gaps obtained from 30 runs of the configuration on the instance set (5 runs/instance). Box-plots of the 18 SMAC runs on each scenario are shown in Figure 4, in which the clustered scenario offers advantage over the basic scenario. A paired t-test
paired t-testis conducted and gives a of , indicating statistical significance.
In the hyper-heuristic community, in particular the Selection Hyper-heuristic class, in which the aim is to manage a set of low-level heuristics during the search by selecting one of them at each iteration, the Simple Random (SR) heuristic selection mechanism is often used as a baseline. In our setting, SR is equivalent to the parameter configuration that has identical weights for all neighborhoods. It will be interesting to compare SR with the resulting configurations obtained from the off-line tuning: for each scenario, the 18 best tuned configurations are taken and the neighborhood weights inside them are set to identical. Their test performance values are shown as basic with identical weights and clustered with identical weights in Figure 5. The horizontal line represents test performance of the algorithm configuration in which neighborhood weights are identical and laList and itIW are set to values recommended by the algorithm’s authors. This configuration is also used as the default configuration for all SMAC runs mentioned above. We can see that the SR versions in both scenarios give worst test performance. A paired t-test is conducted for each scenario:
basic and basic with identical weights: p-value = 0.07464
clustered and clustered with identical weights: p-value = 0.000459
The p-value from the second t-test indicates that the neighborhoods’ weights do have influence on the algorithm performance. Those tests also reflect the hardness of tuning those weights (as the basic tuning fails to show significant improvement over the identical-weight configurations), and the advantage of clustered over basic.
5 Conclusion and future work
In this paper, we have proposed a systematic method to characterize neighborhood behaviours in a multi-neighborhood local search framework, where the probability of choosing a neighborhood at each iteration is chosen in an off-line manner. The characterization is based on the probabilities that a neighborhood will improve, worsen or do nothing on a solution, on the magnitudes of its improvement and worsening, and on its running time. We have observed that these characteristics change according to hardness of different regions in solution quality space. As a result, we design our method such that it tries to detect these regions based on collected information and represent neighborhood behaviours on them as feature vectors. Cluster analysis is then applied to form groups of similar neighborhoods. A tuning experiment with the automated algorithm configuration tool SMAC  shows that using these clusters gives a statistically significant improvement on test performances of the obtained algorithm configurations over the non-clustered version. It verifies the hypothesis that our characterization method is able to correctly reflect neighborhood behaviours on the given instance set. Since the information used in this method does not depend on a specific problem, the characterization and clustering procedure potentially can be applied in similar contexts. A first version of our method’s implementation has been available as a toolbox, and can be obtained by sending a request to the corresponding author. The toolbox receives log files containing necessary information collected during algorithm runs as input, and returns results of the cluster analysis, as well as box-plots and graphs for the visualization of observables and solution quality regions.
For future work, a multi-level tuning might be interesting. Firstly, a post-analysis on the importance of each cluster using fANOVA , which is an efficient approach to “quantify the effect of algorithm parameters”, can be applied. Then finer tuning on neighborhoods that belong to the most important clusters can be done. In addition, since our current method are only limited to a small number of instances, we are seeking for the possibility of an extension to a large set of instances. We might want to exploit problem-specific expert knowledge, e.g., instance features, in such a case.
This work is funded by COMEX (Project P7/36), a BELSPO/IAP Programme. We would like to thank Túlio Toffolo for his great support during the course of this research, Thomas Stützle and Jan Verwaeren for their valuable remarks. The computational resources and services used in this work were provided by the VSC (Flemish Supercomputer Center), funded by the Hercules Foundation and the Flemish Government – department EWI.
-  Holger H Hoos. Automated algorithm configuration and parameter tuning. In Autonomous search, pages 37–71. Springer, 2012.
-  Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. Sequential model-based optimization for general algorithm configuration. In Learning and Intelligent Optimization, pages 507–523. Springer, 2011.
-  Manuel López-Ibánez, Jérémie Dubois-Lacoste, Thomas Stützle, and Mauro Birattari. The irace package, iterated race for automatic algorithm configuration. Technical report, Citeseer, 2011.
-  Tony Wauters, Túlio Toffolo, Jan Christiaens, and Sam Van Malderen. The winning approach for the verolog solver challenge 2014: the swap-body vehicle routing problem. Proceedings of ORBEL29, 2015.
-  W Heid, G Hasle, and D Vigo. Verolog solver challenge 2014–vsc2014 problem description. VeRoLog (EURO Working Group on Vehicle Routing and Logistics Optimization) and PTV Group, pages 1–6, 2014.
-  Helena R Lourenço, Olivier C Martin, and Thomas Stützle. Iterated local search: Framework and applications. In Handbook of Metaheuristics, pages 363–397. Springer, 2010.
-  Edmund K Burke and Yuri Bykov. A late acceptance strategy in hill-climbing for exam timetabling problems. In PATAT 2008 Conference, Montreal, Canada, 2008.
-  B Mustafa Mısır, Stephanus Daniel Handoko, and Hoong Chuin Lau. Oscar: Online selection of algorithm portfolios with case study on memetic algorithms. Learning and Intelligent Optimization, page 59.
-  Raivo Kolde, Sven Laur, Priit Adler, and Jaak Vilo. Robust rank aggregation for gene list integration and meta-analysis. Bioinformatics, 28(4):573–580, 2012.
-  John Aitchison. A concise guide to compositional data analysis. In 2nd Compositional Data Analysis Workshop, CoDaWork’05,, 2005.
-  Juan José Egozcue, Vera Pawlowsky-Glahn, Glòria Mateu-Figueras, and Carles Barcelo-Vidal. Isometric logratio transformations for compositional data analysis. Mathematical Geology, 35(3):279–300, 2003.
-  Charles Bouveyron, Stéphane Girard, and Cordelia Schmid. High-dimensional data clustering. Computational Statistics & Data Analysis, 52(1):502–519, 2007.
-  Laurent Bergé, Charles Bouveyron, and Stéphane Girard. Hdclassif: An r package for model-based clustering and discriminant analysis of high-dimensional data. 2012.
-  Edmund K Burke, Michel Gendreau, Matthew Hyde, Graham Kendall, Gabriela Ochoa, Ender Özcan, and Rong Qu. Hyper-heuristics: A survey of the state of the art. Journal of the Operational Research Society, 64(12):1695–1724, 2013.
Frank Hutter, Holger Hoos, and Kevin Leyton-Brown.
An efficient approach for assessing hyperparameter importance.In
Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 754–762, 2014.