1 Dynamic Switching
The objective motivating this work is to create a solver that dynamically adjusts its search strategy, selecting the most appropriate heuristic for the subproblem at hand. In a high-level overview, we want the solver to perform a standard branch and bound procedure, but before choosing the next branching variable and value, it will analyze the structure of the current subproblem using a set of representative features. Using this structural information, the solver would be able to predict that a specific heuristic is likely better than the alternatives, and employ it to make the next decision. We refer to such a strategy as a Dynamic Approach for Switching Heuristics (DASH).
The specifics of DASH are described in Algorithm 1. Modeled after the ISAC approach [KaMaSeTi:10], DASH assumes that instances that have similar features share the same structure and so will yield to the same algorithm. We will therefore employ clustering to identify these groups of instances. DASH is provided the current subproblem, the heuristic employed by the parent node, the centers of the known clusters, and the list of available heuristics. Because determining the feature can be computationally expensive and because switching heuristics at lower depths of the search tree has a smaller impact on the quality of the search, DASH only chooses to switch the guiding heuristic up to a certain depth and only at predetermined intervals, choosing the parent’s heuristic in all other cases. When a decision does need to be made, the approach computes the features of the provided subproblem and determines the nearest cluster based on the Euclidean distance. In theory, any distance metric can be used here, but in practice we found that Euclidean works well in the general case. In the end, DASH, employs the heuristic that has been determined best for that cluster.
As can be inferred from this algorithm, the key component that determines the success or failure of DASH, is the correct assignment of heuristic to cluster. To train this, we follow a similar procedure first described in [KaMaSe:12]. For each instance in the training set we compute an assortment of subproblems that are observed when using each of our heuristics. This extended problem set allows us to get a better overview of the type of subproblems DASH will be encountering, as opposed to just using the original training instances. Computing the features of the extended problem set, we cluster the instances. For this we employ g-means [HaEl:03]
, a general clustering approach that automatically determines the best number of clusters for the dataset in question. In particular, this clustering approach assumes that a good cluster is one that has a gaussian distribution around the cluster center. Starting with all instances being in a single cluster, g-means iteratively calls 2-means, to split a cluster in two. If the new clusters are more gaussian than the original, the split is accepted and the procedure continues. Once all the instances are clustered, the clusters with fewer instances than a certain threshold are absorbed by the nearest clusters.
Once all the subproblems are clustered, we have to determine which heuristic is best in which scenario. However, an important caveat to this is that the decision of a using a heuristic for a certain cluster also affects all other decisions. This is because DASH can switch heuristics several times, and the types of subproblems observed after applying one heuristic will likely be different then when another one has been applied. Therefore, we employ the parameter tuner GGA [AnSeTi:09] to simultaneously assign heuristics to all clusters, using only the original instances for training.
2 Experimental Setup
In order to set the stage for DASH, three things are necessary. First, we must have a descriptive feature set that can correctly distinguish between different classes of instances, but also do this with minimal overhead. Second, there must be a diverse set of heuristics each of which performs well on different kinds of instances. Finally, there must be a heterogeneous domain, with a large number of benchmark instances. We touch on all three of these components in this section.
We implement our feature computation and heuristics through extending the state-of-the-art MIP solver Cplex version 12.5 [Cplex]. Here, we only modify the built in branching strategy by implementing a branch callback function based on Algorithm 1. Because all the tested approaches require this branch callback to be enabled, the comparability of the results is guaranteed111Note that Cplex switches off certain heuristics as soon as branch callbacks, even empty ones, are being used so that the entire search behavior could be different.. Finally, in order to obtain reliable results, we run each Cplex execution in the single core version. The experiments were run on dual Intel Xeon E5430 quad-core processors (2.66Ghz) computers with 12GB of DDR-2 FB-DIMM 667MHz memory.
2.1 Feature Space
The features have to capture as many aspects of the problems as possible without becoming too expensive to compute. To do this, we gather statistics about the problem definition of the remaining subproblem, a process similar to the one employed in [KaMaSe:12]. Specifically, we compute:
Percentage of variables in the subproblem;
Percentage of variables in the objective function of the subproblem;
Percentage of equality and inequality constraints;
Statistics (min, max, avg, std) of how many variables are in each constraint;
Statistics of the number of constraints in which each variable is used;
Depth in the branch and bound tree.
Wherever a feature has to do with the problem variables, we separately compute the same feature for each type of variable type: eg. continuous, integer, and binary. Therefore, the resulting set is composed of 40 features.
2.2 Branching Heuristics
In order to realize and test our solving approach, we implemented a portfolio of six branching heuristics.
2.2.1 Most Fractional Rounding (MF)
One of the simplest MIP branching techniques is to select the variable that has a relaxed LP solution whose fractional part is most fractional and to round it first. The driving reasoning behind this is to make decisions on variables that deterministic analysis is least certain about. Therefore, this heuristic strives to find infeasible solutions as quickly as possible.
2.2.2 Less Fractional Rounding (LF)
Alternatively to MF, this technique selects the the variable that has a relaxed LP solution whose fractional part is closest to an integer value and to round it first. This is done to gently nudge the deterministic reasoning in whatever direction it is currently pursuing, with a smallest chance of making a mistake.
2.2.3 Less Fractional And Highest Objective Rounding (LFHO)
This heuristic is based on the same motivation behind the Less Fractional Branching. For each subproblem we branch on the variable for which the pair p=(fr, -obj) is minimized (where fr is the fractionality and obj is the objective value). This means that, if we branch on a variable k in [1,n], the following propriety is guaranteed:
2.2.4 Most Fractional And Highest Objective Rounding (MFHO)
We use a modification of the previous approach, but this time we focus on the most fractional variables. For each subproblem we branch on the variable for which the pair p=(fr, obj) is maximized. In this case the guaranteed property is:
2.2.5 Pseudocost Branching Weigthed Score (PW)
This heuristic is based on the pseudocosts, numerical values that estimates the variation in objective value for rounding up or rounding down, called respectively up-pseudocost and down-pseudocost. The pseudocosts of a variable can be combined in a score function (2.2.5) that returns a numeric value. This result is used to guide the branching, for which we choose the variable that maximize the score. Further details can be found in [AcKoMa:04].
2.2.6 Pseudocost Branching Product Score (P)
This approach is based on the same idea as PW. The difference lies in the score function that is now the product of the two pseudocosts.
In order to obtain a solver that works well for a generic MIP problem we collected instances from many different datasets: miplib2010 [MIPLIB2010], fc [At:01], lotSizing [AtMu:04], mik [At:03], nexp [AtNeSa:01], region [LePeSh:00], and pmedcapv, airland, genAssignment, scp, SSCFLP were originally downloaded from [Saxena]. From an initial dataset of about 900 instances we filtered those for which all our solvers timed out in 1,800 seconds. We then removed the easy instances, solved entirely during the Cplex presolving or in less than one second by each solver. We finally obtained a dataset of 341 instances with the desired properties. We randomly selected 180 for the training set and 161 for the testing set.
If we cluster our training data the distribution of instances per cluster can be seen in Table 1. Each row is normalized to sum unto 100%. Thus for Cluster 1, 25% of the instances are from the airland dataset. From this table we first observe that there are not enough clusters to perfectly separate the different datasets into unique clusters. This, however, is not what we would want to see. This is because we are more interested in capturing similarities between instances, not splitting benchmarks. And we observe that the region100 and region200 instances are grouped together. We also see that Cluster 4 logically groups the LotSizing and the SSCFLP instances together. Finally, we see that the instances from the miplib, those instances that are supposed to be an overview of all problem types, are spread across all clusters.
This clustering therefore demonstrates that we both have a diverse set of instances and that our features are representative enough to automatically notice interesting groupings.
3 Numerical Results
With the described methodology, the main question that needs to be addressed is whether switching heuristics can indeed be beneficial to the performance of the solver. To test this, for each of the instances in our test set, we ran each of the implemented heuristics without allowing any switching. We then also ran two versions of a solver that switched between heuristics uniformly at random. The first solver switched between all heuristics, while the second switched only among the top four best heuristics. The results are summarized in Table 2.
What we observe is that neither of the random switching heuristics perform very well by themselves. However, based on the performance of the virtual best solver222VBS is an oracle solver that for every instance always uses the strategy that results in the shortest runtime that employs these new solvers, the performance can be further improved beyond what is possible when always sticking to the same heuristic. The question therefore now, becomes, if we can get improved performance just by switching between heuristics randomly, can we do even better if we do so intelligently?
To answer this question, we must first set a few parameters of our solver. Particularly, till what depth should we allow our solver to switch heuristics, and at what interval? For this, we cluster the extended dataset that includes both the original training instances and the possible observed subproblems. There are a total of 10 clusters formed. Projecting the feature space into two dimensions using Principal Component Analysis (PCA)[AbWi:10] we present Figure 1. Here, the cluster boundaries are represented by the solid lines, and the best heuristic for each cluster is represented by a unique symbol at its center. On these figures, we also show the typical way in which features change as the problem is solved with a particular heuristic. The nodes are colored based on the depth of the tree, with (a) showing all the observed subproblems and (b) that of a single branch.
What this figure shows is that the features change gradually. This means that there is no need checking the features at every decision node. We therefore choose to check the subproblem features at every 3rd node. Similarly, the figure and those like it, show that using a depth of 10 is reasonable, as in most cases the nodes don’t span across more than two clusters.
We use GGA to tune the parameters of DASH, computing the best heuristic for each cluster. We then present the results in Table 3 where we compare it to a vanilla ISAC approach that for a given instance chooses the single best heuristic and then does not allow any switching. What we observe is that DASH is able to perform much better than its more rigid counterpart. However, we do allow for the possibility that switching heuristics might not be the best strategy for every instance. We therefore also introduce DASH+, which first clusters the original instances using ISAC and then allows each cluster to independently decide if it wants to use dynamic heuristic switching.
Taking a lesson from [KrMa:11]
, which shows that often the features are not equally important, we tried to achieve better overall performance including a feature selection operation. In this paper we utilize the information gain filtering technique, often used in decision trees. In particular, this method is based on the calculation of entropy of the data as a whole and for each class. We apply the feature filtering to ISAC and DASH+ referring to them, respectively, as ISAC_filt and DASH+filt, having an improvement in both cases. In particular, the resulting solver DASH+filt performs considerably better than everything else.
We finally show the performance of a virtual best solver if allowed to use DASH. And what we observe is that even though the current implementation cannot overtake VBS, future refinements to the portfolio techniques will be able to achieve performances much better than techniques that rely purely on sticking to a single heuristic.
In this paper we introduce a Dynamic Approach for Switching Heuristics (DASH). Using MIP as the running example, we show how to automatically determine when a subproblem observed during a branch and bound search is significantly different from what has been observed before, and therefore warrants a change of tactics used while solving it. Employing a diverse set of instances we demonstrate that significant performance improvements are possible if a solver does not stick to using a single guiding heuristic.