Sample size calculations for the experimental comparison of multiple algorithms on multiple problem instances

by   Felipe Campelo, et al.
Aston University

This work presents a statistically principled method for estimating the required number of instances in the experimental comparison of multiple algorithms on a given problem class of interest. This approach generalises earlier results by allowing researchers to design experiments based on the desired best, worst, mean or median-case statistical power to detect differences between algorithms larger than a certain threshold. Holm's step-down procedure is used to maintain the overall significance level controlled at desired levels, without resulting in overly conservative experiments. This paper also presents an approach for sampling each algorithm on each instance, based on optimal sample size ratios that minimise the total required number of runs subject to a desired accuracy in the estimation of paired differences. A case study investigating the effect of 21 variants of a custom-tailored Simulated Annealing for a class of scheduling problems is used to illustrate the application of the proposed methods for sample size calculations in the experimental comparison of algorithms.



There are no comments yet.


page 1

page 2

page 3

page 4


Sample size estimation for power and accuracy in the experimental comparison of algorithms

Experimental comparisons of performance represent an important aspect of...

Assurance for sample size determination in reliability demonstration testing

Manufacturers are required to demonstrate products meet reliability targ...

Efficient and flexible simulation-based sample size determination for clinical trials with multiple design parameters

Simulation offers a simple and flexible way to estimate the power of a c...

Comparison of Classification Algorithms Towards Subject-Specific and Subject-Independent BCI

Motor imagery brain computer interface designs are considered difficult ...

Optimal Sample Size Planning for the Wilcoxon-Mann-Whitney-Test

There are many different proposed procedures for sample size planning fo...

Minimal sample size in balanced ANOVA models of crossed, nested and mixed classifications

We consider balanced one-, two- and three-way ANOVA models to test the h...

Reactive Sample Size for Heuristic Search in Simulation-based Optimization

In simulation-based optimization, the optimal setting of the input param...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Experimental comparison of algorithms has long been recognised as an essential aspect of research on meta-heuristics [2, 36, 47, 18]. Despite well-earned criticisms of some disconnect between theory and experimentation [35, 17] and a few methodological issues that still require adequate attention [24, 6], experimental evaluation remains a central aspect and a major component of the development and understanding of EAs.

This reliance on experimental assessment and comparison of algorithms is evidenced by the continuing effort of researchers in devising better experimental protocols for performance assessment and comparison of algorithms. While many of the most important points were presented as far back as the late 1990s [2, 47, 36], research into adequate protocols and tools for comparing algorithms has continued in the past two decades, with several statistical approaches being proposed and employed for comparing the performance of algorithms [18, 39, 59, 20, 60, 9, 11, 5, 7, 30, 29, 22, 16, 21, 8, 40, 33, 15, 13]. This increased prevalence of more statistically sound experiments in the field of optimisation heuristics can be seen as part of the transition of the area into what has been called the scientific period of research on metaheuristics [57].

Despite of this area-wide effort, a few important aspects of experimental comparisons of algorithms remain largely unexplored in the literature. One of those topics is the question of how to adequately determine the relevant sample sizes for the comparisons of algorithms - number of problem instances to use, and number of runs to employ for each algorithm on each instance. The standard approach has been that of maximizing the number of instances, limited only by the computational budget available [2, 30, 28, 22, 1], and of running arbitrarily-set numbers of repeated runs, usually 30 or 50. While it is indeed true that the sensitivity of comparisons increases with the number of instances, this does not mean that a large sample size can substitute a well-designed experiment [44, 46, 15]. Also, it is important to be aware that arbitrarily large sample sizes allow tests to detect even minuscule differences at arbitrarily strict significance levels, which may lead to the wrongful interpretation that effects of no practical consequence are strongly significant [46, 4] if certain methodological safeguards are not put into place when designing the experiment [15].

The dual issues of statistical power and sample size have been touched by a few authors in the past, albeit only superficially [38, 18, 19, 53, 4, 5, 7]. Until recently, the best works on the subject were those by Mauro Birattari [9, 10] and Chiarandini and Goegebeur [7, Ch. 10]. The former correctly advocated for a greater focus on the number of instances than on repetitions, showing that the optimal allocation of computational effort, in terms of accuracy in the estimation of mean performance for a given problem class, is to maximize the amount of test instances, running each algorithm a single time on each instance if needed. Birattari’s approach, however, was focussed only on accuracy of parameter estimation for the mean performance across a problem class, yielding very little information on the specific behaviour of each algorithm on each instance, as well as not considering questions of desired statistical power or sample size calculations. The work by Chiarandini and Goegebeur [7, Ch. 10] provided a good discussion on statistical power and sample size in the context of nested linear statistical models, as well as some guidelines on the choice of the number of instances and number of repeated runs based on the graphical analyses of power curves. It was, however, limited to nested models, and required manual inspection of power curves by the user, which may have precluded its broader adoption in the literature.

More recently, we have proposed a principled approach for calculating both the number of instances and number of repeated runs when comparing the performance of two algorithms for a problem class of interest [15]. That approach calculates the number of instances based on the desired sensitivity to detect some minimally relevant effect size (MRES), i.e., the smallest difference between algorithms that is considered to have some degree of practical relevance. Alternatively, the number of instances can be fixed a priori (e.g., when using standard benchmark sets) and the sensitivity curves can be derived instead. As for the number of repeated runs per instance, the approach proposed in [15] was based on minimising the number of algorithm runs required to achieved a the desired accuracy in the estimation of the paired differences in performance for each instance. While statistically sound, the fact that the results presented in that work applied only to comparisons of two algorithms limited their applicability to the general case of experimental comparisons of algorithms, which often aim to investigate the relative performances of multiple algorithms, or multiple variants of a given algorithmic framework.

In this paper we generalise the results from [15] to calculate the required sample sizes for the comparison of an arbitrary number of algorithms. The calculation of the number of instances is based on considerations regarding the test of multiple hypotheses that commonly occurs in these contexts, correcting the significance level of each test so as to maintain the familywise error rate controlled at a given desired level [55]. Optimality-based ratios of sample sizes are derived for sampling algorithms on each individual instance, both for the simple difference of means and for two types of percent differences. A simple sampling heuristic is also provided, which can easily generalise the sampling approach presented here for cases in which general performance statistics are of interest.222All methods presented in this paper are implemented in the form of R package CAISEr, available at [52].

The remainder of this paper is organised as follows: Section 2 explicitly defines the algorithm comparison problem considered in this work, states the hypotheses to be tested, and provides the rationale for the choices that justify the proposed approach for sample size calculation. Section 3 describes the proposed approach for sampling an arbitrary number of algorithms on any given problem instance, and presents the derivation of optimal sample size ratios for three common cases. Section 4 describes the general concepts behind the estimation of the required number of instances for an experiment to achieve desired statistical properties. Section 5 presents an example of application of the proposed methodology for the evaluation of 21 algorithmic variants of a state-of-the-art, custom-tailored simulated annealing approach [54] for a class of scheduling problems [58]. Finally, Section 6 presents final considerations and conclusions.

2 Problem Definition

Before proceeding to investigate the required sample sizes for the experimental comparison of algorithms, it is important to formally define the questions one is trying to answer when those comparisons are performed. While there are several different scientific questions that can be investigated by experimentation, arguably the most general case for experimental meta-heuristic comparisons is: given a finite subset of problem instances and a finite number of runs that can be performed by each algorithm being tested, what can be said regarding the relative performance of those algorithms (according to a given set of performance indicators) for the problem class from which the test instances were drawn? Notice that this question is broad enough to encompass the most common cases in the experimental comparison of algorithms, and even several comparisons of scientific relevance that are not routinely performed [24].

Aiming to highlight the specific aspects that are considered in the present work, we refine this problem definition as follows: Let represent a finite sample of problem instances drawn from a given problem class of interest, ; and let denote a set of algorithms333Each representing a complete instantiation of a given algorithmic framework, i.e., with both structure and parameter values fully specified that we want to compare. We define the algorithm comparison problem in the context of this work as that of obtaining a (partial) ordering of the algorithms according to their mean performance on the problem class of interest, based on the observed performance values across the set of test instances used. We assume here that (i) all algorithms can be run on the same sample of instances; (ii) any run of an algorithm returns some tentative solution, which can be used to estimate the performance of that algorithm for the problem instance; and (iii) the researcher is interested primarily in comparing the performance of the algorithms for the problem class , instead of for individual instances. Each of these three experimental assumptions can be associated with a specific aspect when comparing algorithms:

  • Assumption (i) indicates that the variability due to instance effects can be modelled out of our analysis by pairing or blocking [15, 50], which is the underlying approach of methods such as ANOVA with blocking, or Friedman’s test [20].

  • Assumption (ii) essentially prevents us from having to deal with the problem of missing values in our analysis, since every run will return some valid performance observation. The question on how to deal with missing values in the comparative analysis of algorithms - e.g., if a given algorithm fails to converge in a study where performance is measured by time-until-convergence - is a very relevant one, but it falls outside the scope of this particular work.

  • Assumption (iii) is associated with the fact that comparative testing of algorithms should, in the vast majority of cases, be focused on generalising the performance observed on limited test set to the wider class of problems . This contrasts with a somewhat common practice in the heuristics literature of performing individual Rank Sum tests on each instance and tallying up the wins/losses/ties, without any further inference being performed on these summary quantities. The problem with this approach for the comparison of algorithms on multiple problem instances has been recognised at least since 2006 [20], but despite of this the practice still persists in the literature.

Focusing on testing hypotheses regarding the (relative) performance of algorithms for a problem class of interest also determines two other aspects of the design and analysis of comparative experiments. The first is the issue of what is considered the effective sample size for these experiments. As argued in earlier works [15, 9, 7], the effective sample size to be considered when testing the hypotheses of interest is the number of instances, rather than the amount of repeated runs (or even worse, the product of these two quantities). Failure to account for this when performing inference results in pseudoreplication [37, 48, 43]

, that is, the (often implicit) use of artificially inflated degrees-of-freedom in statistical tests. This results in violation of the independence assumption underlying these tests, and results in actual confidence levels that can be arbitrarily inferior to the nominal ones.

The second aspect, which is a consequence of the first, is that the analysis of data generated by the experiments can be done using either a hierarchical model [31] or, alternatively, a block design applied on summarised observations, in which the performance of each algorithm on each instance is summarised as a single value, often the mean or median of multiple runs. This second approach is equivalent to the first under the assumption (iii) stated above, and results in the application of the well-known and widely-used methods for comparisons of the average performance of multiple algorithms on multiple problem instances: omnibus tests such as Complete Block Design (CBD) ANOVA or Friedman’s test [20]. Those tests are used to investigate whether at least one algorithm has an average performance different from at least one other. If that is the case, then multiple comparison procedures are employed to answer the more specific statistical questions of which algorithms differ from which in terms of their average performance [56, 50].

In the following, we detail the hypotheses of interest that are tested by these statistical procedures when performing comparative experiments with algorithms, and suggest that a greater focus on the multiple comparison procedures can provide experimenters with better tools for designing and analysing their experiments.

2.1 Test Hypotheses

Let denote the performance of algorithm on instance . Based on the definitions from the preceding section, we can decompose the performance according to the usual model of the complete block design [49]:


in which denotes the mean performance of algorithm for the problem class after the (additive) effect of each instance, , is blocked out; and is the residual relative to that particular observation, i.e., the term that accounts for all other unmodelled effects that may be present (e.g., uncertainties in the estimation of ). Notice that the algorithm mean can be further decomposed into a grand mean of all algorithms for the problem class of interest, , and the effect of algorithm on this grand mean, represented by . By construction, .

As mentioned earlier, an usual approach for comparing multiple optimisation algorithms on multiple problem instances (which is an analogous problem to that of testing multiple classifiers using multiple data sets

[20]) employs the usual 2-step approach provided in most introductory texts in statistics. It starts by using omnibus tests, such as CBD ANOVA or Friedman’s test, to investigate the existence of some effect by testing hypotheses of the form:


which is equivalent to testing whether all algorithms have the same mean value versus

the existence of at least one algorithm with a different mean performance for the problem class of interest. If the omnibus test returns a statistically significant result at some particular confidence level, pairwise comparisons are then performed as a second step of inference, to pinpoint the significant difference(s). These are usually performed using common tests for the difference of means of two matched populations, such as variations of the paired t-test for post-ANOVA analysis, or Wilcoxon’s Signed Ranks test (or the Binomial Sign Test) for post-Friedman analysis.

There are several different sets of pairwise comparisons that can be executed in this step, but the two most commonly used ones are all vs. all and the all vs. one. In the former, tests are performed on the paired difference of means (again, with instance as the pairing factor) for all pairs of algorithms :


in which represents the mean of the paired differences in performance between and . All vs. all pairwise comparisons are the default approach in the meta-heuristics literature, but in practice are only required when one is really interested in obtaining a complete “ordering” of all algorithms in a given experiment.

The second type of pairwise comparisons is the all vs. one, in which there is a reference algorithm (e.g. a new proposal one is interested in evaluating). Assuming that the first algorithm, , is always set as the reference one, all vs. one pairwise comparisons are defined as:


for all . This approach results in only hypotheses to be tested and, as we will discuss later, results in inferential procedures with higher statistical power or, if sample sizes are being estimated, in experiments requiring fewer instances to achieve a given desired power [46].444Statistical power (or sample sizes) can be further improved if the alternative hypotheses in (4) are directional,

which can be employed if the researcher is specifically interested in knowing whether or not the reference algorithm is superior (in terms of mean performance) to the others (i.e., if equality or inferiority are considered equally undesirable outcomes). Of course it makes no sense trying to use directional alternative hypotheses in the all vs. all case, since in that case there is no reference algorithm common to all tests.

It is important to highlight at this point that the ANOVA or Friedman tests are not strictly required when performing the comparison of multiple algorithms. From an inferential perspective, it is equally valid to perform only the pairwise comparisons, as long as the rejection threshold of each hypothesis tested is adequately adjusted to prevent the inflation of Type-I error rates resulting from multiple hypothesis testing


. While it is true that post-ANOVA pairwise tests can usually benefit from increased sensitivity (by using the more accurate estimate of the common residual variance that results from the ANOVA model), this improvement of statistical power is only achieved under the assumption of equality of variances, which is rarely true in algorithm comparisons. We argue that this potential (albeit occasional) disadvantage is more than compensated by a range of possible benefits that emerge when we focus on the pairwise comparisons:

  • Tests for the mean of paired differences are generally much simpler than their corresponding omnibus counterparts, and generally require fewer assumptions. For instance, unlike the CBD ANOVA, the paired t-test does not require equality of variances between algorithms, nor does it assume that the data from each algorithm is normally distributed – only the paired differences need be approximately normal, and not even that if the sample size is large enough. These tests also yield much more easily interpretable results, and are considerably simpler to design and analyse.

  • In the context of experimental comparison of algorithms one is commonly interested in relatively few tests: the number of algorithms compared is usually small, and in several cases tests can be performed using the all vs. one approach (e.g., when assessing the performance of a proposed method in comparison to several existing ones). This alleviates the main issue with testing multiple hypotheses, namely the loss of statistical power (or, alternatively, the increase in the sample size required to achieve a certain sensitivity) that results from adjusting the significance levels to control the Type-I error rate. [55].

  • From the perspective of estimating the number of required instances for algorithm comparisons, one of the most challenging quantities that need to be defined in the design stage of these experiments is a minimally relevant effect size (MRES) [15]

    . The definition of MRES in the context of omnibus tests is very counter-intuitive, since it must be defined in terms, e.g., of the ratio of between-groups to within-groups variances. While still challenging, the definition and interpretation of MRES in the context of comparing a pair of algorithms is incomparably simpler: it essentially involves defining the smallest difference in mean performance between two algorithms (either in terms of raw difference or normalised by the standard deviation) that would have some practical consequence in the context of the research being performed


Given the considerations above we argue that, while it is common practice to perform the experimental analysis as the 2-step procedure outlined earlier, it is often more practical to focus on the pairwise tests, particularly from the perspective of sample size calculations. In the following sections we detail how the two sample sizes involved in the experimental comparison of algorithms – number of instances, number of runs of each algorithm on each instance – can be estimated by focusing on the pairwise comparisons. This of course does not preclude the use of omnibus tests as part of the analysis, or of any other type of analytical tool for investigating and characterising the behaviour and performance of the algorithms being tested. In fact, sample sizes estimated by focusing on the pairwise comparisons result in omnibus tests with at least the same statistical power for which the pairwise tests were planned, assuming that their assumptions, which are often more restrictive, are satisfied. However, the use of omnibus tests becomes optional, and the analyst can choose to proceed directly with the pairwise tests knowing that the statistical properties of their experiment are guaranteed.

Finally, based on the statistical model from (1) we can define the two kinds of paired differences in performance, which we will examine in this work. The first one is the simple paired difference in performance, , which can be estimated from the data as


in which and are the mean (or median) observed performance of algorithms and on instance , computed from and runs, respectively. An alternative approach is to use percent differences, which can be estimated for the all vs. one case as


and for the all vs. all case as


in which is the estimate of the grand mean of all algorithms on instance ,


3 Calculating the number of repetitions for comparisons of multiple algorithms

3.1 Preliminary Statistical Concepts

Before we present the derivation of sample size formulas, there are a few statistical concepts that need to be clarified. More information and detailed discussions of these concepts are available in our previous work [15], and here we will only present a very brief review of these definitions.

The minimally relevant effect size (MRES), which is an essential concept for determining the smallest sample size in the comparison of algorithm, is defined in the context of this work as the smallest difference between two algorithms that the researcher considers as having some practical effect. There exists several different effect size estimators that are relevant for a variety of experimental questions [25], but in the context of the hypotheses considered in this work we focus on Cohen’s coefficient,


which is the (real) value of the mean of the paired differences555More generally, the denominator of Cohen’s

is the magnitude of the deviation between the actual value of a parameter and the value suggested under the null hypothesis. However, in this work we always assume the null hypothesis to suggest a difference of zero, hence the simpler definition.

normalised by , the (real) standard deviation of the paired differences. In the context of this work, the value in the numerator of (9) is estimated as the mean of the values calculated using (5) (for the simple paired difference) or (6)–(7) (for the percent paired differences).

Based on this definition, the MRES is defined as


i.e., as the smallest magnitude of the standardised mean of paired differences between two algorithms that would actually result in some effect of practical relevance.666Different areas have distinct standards of what constitutes small or large effect sizes in terms of [25], and domain-specific knowledge needs to be considered when determining . We provide some broad guidelines in our previous work [15], but a broader investigation of generally accepted MRES values in the experimental research on meta-heuristics is yet to be conducted.

Let denote the observation of performance of algorithm on the -th run on a given instance , and


denote the sample estimator of the mean performance of on that instance, based on independent runs. Under relatively mild assumptions, we know that the sampling distribution of means converges to a Normal distribution even for reasonably small sample sizes, with


where the

term is the squared standard error of the sample mean.

Given a single problem instance and the algorithms one wishes to compare, the approach for calculating the number of repetitions is similar to the one outlined in [15], namely finding the smallest total number of runs such that the standard errors of estimation – which can be interpreted as the measurement errors of the pairwise differences in performance – can be controlled at a predefined level. This can be expressed as an optimisation problem,


in which is the number of runs of algorithm on instance , is the standard error of estimation of the relevant statistic (e.g., simple or percent difference of means) for a pair of algorithms . The pairs of interest depend on the types of comparisons planned. While there can be a multitude of comparison types that can be performed, as said before, two comparisons are of particular interest in experimental algorithm: all vs. all and one vs. all. All pairs are of interest in the all vs. all case while, In the all vs. one case, if is defined as the reference algorithm, then the pairs of interest are .

It is worth noting that the most adequate formulation for this problem is an integer one, but in this work we employ a relaxed version of the formulation that accepts continuous values. Since this formulation is used only to derive reference values for the optimal ratio of sample sizes (which can be continuous even under the discrete sample size constraint) this relaxation does not result in any inconsistencies. Also notice that there is a second set of constraints that is omitted from formulation (13), namely , which are needed to guarantee that standard deviations can be estimated for each algorithm. We did not explicitly include these constraints since the proposed method assumes that all algorithms will be initially run a number times before iterative sample size estimations start, so these constraints will always be satisfied.

The specific calculation of the standard errors depends on the type of differences being considered: simple differentes or percent differences and, in the latter case, whether the planned comparisons are an all vs. one or all vs. all case. The results for the comparison of two algorithms were presented in our previous work [15], and in this section we generalise those results to the case of comparisons of multiple algorithms. We also introduce a heuristic that can be used to estimate the numbers of repetitions in the case of more general statistics that may be of interest in the context of experimental comparison of algorithms.

3.2 Derivation of optimal ratios

While it was possible to derive an analytic solution to the problem of determining the optimal ratio of sample sizes in the two algorithm case [15], a similar closed solution does not seem to be possible for the general problem (13) with . It is, however, possible to derive optimality-based ratios of sample sizes for any pair of algorithms, which allows us to propose a principled heuristic to generate adequate samples for a set of algorithms on any individual instance. The proposed heuristic iteratively generates observations of performance for each of the algorithms considered, so that the constraints can be satisfied with as few total algorithm runs as possible. This heuristic is described in Algorithm 1.

1:Instance ; Algorithms ; accuracy threshold ; initial sample size ; maximum sample size .
2:Generate initial samples of each algorithm on instance .
3:Estimate all of interest
4:while  and  do
6:     Perform one additional run of the algorithm that is contributing the most to on instance
7:     Estimate all of interest.
8:end while
9:return Vectors of observations generated for all algorithms.
Algorithm 1 Sample algorithms on a single instance.

The procedure detailed in Algorithm 1 is general enough to accommodate any statistic one may wish to use for quantifying the pairwise differences of performance between algorithms, as long as the standard errors and their sensitivities to changes in can be estimated (which can be done using bootstrap, if analytical expressions are not available). In the case where analytical expressions for the optimal ratio of sample sizes between two algorithms are available (see below), the determination of the algorithm that is contributing the most to on a given instance (line 5 of the algorithm) can be done based on these optimal ratios: if the observed ratio is smaller than the optimal one, run the algorithm in the numerator; otherwise, run the one in the denominator.

Another point worth mentioning is that the procedure outlined in Algorithm 1 is essentially a greedy approach to reduce the worst-case standard error, which means that even if it is interrupted by the computational budget constraint the resulting standard errors of estimation will be the smallest ones achievable. This, in turn, means that the residual variance due to these estimation uncertainties will be as small as possible, which in turn increases the sensitivity of the experiment to smaller differences in performance by reducing, even if slightly, the denominator term in (9).

The derivation of the standard errors and optimal ratios of sample sizes, for the simple and percent differences of means, both for the all vs. one and all vs. all cases, are presented below. Notice that while we can derive optimal sample size ratios for all paired differences of interest, Algorithm 1 will result in ratios that can deviate (sometimes substantially so) from the optimal reference values. This is a consequence of the fact that ratios of sample sizes are not independent quantities when multiple algorithms are compared. In practice, pairs that present the worst-case standard error () more often will tend to have ratios of sample sizes closer to the optimal values derived below, with the other pairs deviating from optimality due to the algorithm focusing the sampling on the pairs that are associated with . This, however, does not reduce the validity and usefulness of the optimality derivations (as they are needed to decide which algorithm from the selected pair should receive a new observation), nor does it affect the quality of the results provided by Algorithm 1.

Finally, for the sake of clarity we will omit the instance index from all derivations in the next section, but the reader is advised to keep in mind that the calculations in the remained of this section are related to a single given instance.

3.2.1 Simple differences of means

In the case of the simple differences of means, the statistic of interest for any pair of algorithms is


which is estimated using the sample quantities


Given (12), the sampling distribution of will be


with squared standard error


For any pair of algorithms , the optimal sample size ratio for a given instance can be found using the formulation in (13):


This formulation has a known analytical solution [15] in terms of the ratio of sample sizes:


which can be rewritten in terms of sample estimators as


with being the sample standard deviations of the observations of algorithms , respectively. In this case the decision rule for which algorithm should receive a new observation (line 5 of Algorithm 1) can be expressed as: let be the two algorithms that define . Then:


It is worthwhile to notice that this result is independent on whether one is interested in all vs. all or all vs. one comparisons, the only difference being which pairs generate constraints in (13).

3.2.2 Percent differences of means

For the percent differences of means777Strictly speaking the difference discussed here is the proportional (per unit) difference, which must be multiplied by a factor of to be actually expressed as percent. This, however, has no effect in the derivations that follow, and is merely a matter of linear scaling. we need to distinguish between the two types of comparisons discussed in this work.

All vs. one:

For all vs. one comparisons, assuming is the reference algorithm, the statistic of interest is given as:


which can be interpreted as the percent mean loss in performance of when compared to . This is estimated using


which is distributed according to one plus the ratio of two normal variables , with

If we can assume that all algorithms involved return observations that are strictly positive (which is quite common, e.g., in several families of problems in operations research as well as for several common performance indicators in multiobjective optimisation), the squared standard error of (23) can be calculated as [15, 26, 27]:



The KKT conditions for the problem formulation in (18) state that, at the optimal point the following should be true:


The gradient of the objective function is trivially derived as , and the partial derivatives of are:


The first row of (25) can then be expanded as:


which means that at the optimal point.888Equations (27) also mean that the Lagrange multiplier must be strictly greater than zero, which in turn means that is active at the optimal point, as expected from a sample size minimisation perspective. Isolating the ratio of sample sizes and simplifying finally yields:


which can be estimated from the sample using:


Given this expression for , the same rule expressed in (21) can be used in Line 5 of Algorithm 1, simply replacing by .

All vs. all:

For all vs. all comparison using percent differences the question of which mean to use as the normalising term in the denominator must be decided. Assuming that no specific preference is given to any of the algorithms in this kind of comparison, we suggest using the grand mean instead of any specific mean , i.e.:


which can now be interpreted as the difference between the means of and , in proportion to the grand mean. This is estimated using



is an estimator of the grand mean . This estimator is unbiased regardless of unequal sample sizes,


and has a squared standard error of:


Under the same mild assumptions under which , we have that the sampling distribution of this estimator is progressively closer to a normal distribution as the sample sizes are increased. These considerations allow us to calculate the standard error of (31) in the same manner as (24), using the simplified formulation of Fieller’s standard error [26, 27], i.e.:



with defined as in (30). This new expression for the standard error leads to (26) being replaced by:


which, when inserted into the KKT conditions (25) yields:


and, consequently,


which is identical to the optimal ratio for the simple differences of means, and can be similarly estimated using the sample standard deviations. Once again, the same decision rule from (21) can be used.

4 Calculating the number of required instances for comparisons of multiple algorithms

In this section we deal with the calculation of the number of instances, , required to obtain an experiment with predefined statistical properties. In a previous work [15]

we developed sample size formulas for the comparison of two algorithms on a problem class of interest. Since we have reduced the multiple algorithm comparison problem to a series of pairwise comparisons (at least for the purposes of sampling), most of the theoretical justifications remain the same, and the only actual difference in the calculation of the required number of instances is the need to adjust the significance levels so as to maintain the overall probability of Type-I errors (false positives) controlled at the desired level


The calculation of the number of instances for comparing two algorithms is based on the definition of a few desired statistical properties for the test [15]. More specifically, we want the test to have a predefined statistical power, , to detect differences equal to or greater than a minimally relevant effect size , at a predefined significance level . This led to the definition of the smallest required number of instances for the comparison of two algorithms (based on the paired t-test, for a two-sided alternative hypothesis) as:


in which is the

-quantile of the central t distribution with

degrees of freedom, and is the -quantile of the non-central t distribution with degrees of freedom and noncentrality parameter . The required sample sizes can be further reduced in the all vs. one comparisons if one-sided alternative hypotheses can be used, in which case we simply replace by in (38).999The derivation of these results and a review of the statistical concepts associated with it are available in [15]. In that work we also showed how these results can be easily extended to the design of experiments based on nonparametric alternatives to the paired t-test, such as Wilcoxon’s Signed Ranks test or the Binomial Sign test, by using the asymptotic relative efficiency of these tests [50, 56]. The same generalisations apply to the current work.

Since our proposed protocol for sample size calculation is based on designing the experiments from the perspective of the pairwise tests that will need to be performed, the sample size formulas outlined above remain valid, and we only need to adjust the significance level to account for the testing of multiple hypotheses. As mentioned previously, we have two common cases in the experimental comparison of algorithms: all vs. all comparisons (3), which result in tests; and all vs. one comparisons (4), in which case only comparisons are performed.

As discussed by Juliet Shaffer in her review of multiple hypothesis testing [55], “when many hypotheses are tested, and each test has a specified Type I error probability, the probability that at least some Type I errors are committed increases, often sharply, with the number of hypotheses”. This is quite easy to illustrate if we consider that each test has a base probability of falsely rejecting a true null hypothesis given by its significance level . If comparisons are performed, and assuming that (i) the comparisons are independent, and (ii) all null hypotheses are true, the overall probability of observing one or more false positives (commonly called the family-wise error rate, or FWER) can be calculated as

As an example, if we consider a somewhat typical case of all vs. all comparisons of 5 algorithms (which would generate hypotheses) with a per-comparison , this would result in an , i.e., a chance of around of falsely rejecting at least one of the null hypotheses.

To prevent this inflation of the FWER, a wide variety of techniques for multiple hypothesis testing are available in the literature. Possibly the most widely known is the Bonferroni correction [23, 55], which can maintain the FWER strictly under a desired level by using, for each comparison, a reduced significance level (or, equivalently, by multiplying all p-values by and comparing against the original ). This correction is known to be overly conservative, leading to substantial reductions of the statistical power of each comparison, or to substantially higher sample sizes being required to achieve the same power. However, its simplicity often makes it quite appealing, particularly when few hypotheses need to be tested, in which case the penalisation applied to is not too extreme. If the Bonferroni correction is selected to control the FWER in a given experiment, the calculation of the number of instances can be performed by simply dividing the value of in (38) by the number of planned comparisons ( for the all vs. one case, or for all vs. all). Figure 1 illustrates an example of the increase in the required sample size for Bonferroni-corrected tests as the number of comparisons is increased.

Figure 1: Example of required sample sizes for Bonferroni-corrected tests in a common case with and , considering a two-sided alternative, in comparison with the sample sizes required for obtaining the same mean power under Holm’s correction. For reference, fifteen comparisons would represent all vs. all comparisons involving six algorithms.

A uniformly more powerful alternative that also controls the FWER at a predefined level is Holm’s step-down method [34, 55]. Instead of testing all hypotheses using the same corrected significance level, Holm’s method employs a sequential correction approach. First, the p-values for all tests are computed. The hypotheses and their corresponding p-values are then ranked in increasing order of p-values, such that . Each of these ordered hypotheses is then tested at a different, increasingly stricter significance level,


in which is the rank of the hypothesis being tested. If is the largest value of for which a hypothesis would be rejected at the corresponding level, then Holm’s step-down method leads to the rejection of all null hypotheses with ranks . Holm’s method is also known to maintain the FWER under the desired value [55], but it leads to significantly less conservative tests than the Bonferroni correction. However, while the calculation of the number of instances is quite straightforward for the Bonferroni correction, it requires some clarification when Holm’s method is employed. This is because each test is performed using a different significance level, which leads to individual tests with heterogeneous power levels under a constant sample size.

One possible approach, which is found in the statistical literature, is to calculate the sample size based on the least favourable condition. We can calculate the sample size such that the test with the most strict significance level will have at least the desired power , which guarantees that the statistical power for all other comparisons will be greater than . A simple examination of (39) shows that the most strict test will occur for , in which case , i.e., the same corrected significance level generated by the Bonferroni correction. In this case, the same simple adjustment can be made for the formulas in (38), namely dividing the value of by . A related possibility, which generalises this one, is to calculate the sample size so that some predefined number of tests can have a statistical power of at least . This can also be easily set up by determining a number of comparisons that should have power levels above the predefined threshold, and then divide by in (38).

A second possibility, which is also relatively straightforward, is to design the experiment so that the mean (or median) power of the tests is maintained at the nominal level. The median case is roughly equivalent to setting and applying the method outlined above. Estimating sample size for achieving a mean power of is more challenging from an analytic perspective, but can be done iteratively without much effort using Algorithm 2. In this pseudocode denotes the calculation of the required sample size for performing a hypothesis test (e.g., using a paired t-test) with significance level , power , MRES and type of alternative hypothesis , as defined in (38). Similarly, calculates the power of a test procedure under the same parameters, assuming a sample size of . Notice that since power increases monotonically with , the final sample size returned by the procedure outlined in Algorithm 2 is guaranteed to be smaller than the result based on the least favourable condition.

1:Desired mean power ; desired FWER ; type of alternative hypotheses (one or two-sided) ; number of comparisons ()
3: Using (38)
4:while  do
6:     for  do
7:          See [15] for details.
8:     end for
9:      Calculate mean power
10:end while
Algorithm 2 Estimate sample size for mean power in Holm’s procedure.

Figure 2 illustrates the resulting power profile of an experiment designed for mean power using the procedure outlined in Algorithm 2. Two interesting aspects of this planning are immediately clear. First, the worst-case power appears to stabilize at about , which is not much lower than the desired , and would still be considered an acceptable power level in most applications. The second one is that the resulting sample sizes are, as expected, lower than those required if the Bonferroni correction were to be used.

Figure 2: Example of the resulting power of each comparison when designing experiments focused on mean power using Holm’s method, in a common case with and , considering a two-sided alternative. The lines illustrate the mean, best and worst-case power values, and the gray dots represent the power of each comparison. The labels following the mean-power line inform the sample sizes required for each value of . Notice that the worst-case power seems to stabilize around in this particular case.

Based on these considerations, we can offer some suggestions as to how to proceed calculating the required number of instances () for the comparison of multiple algorithms using multiple problem instances:

  • It is more efficient, often considerably so, to use Holm’s correction instead of Bonferroni’s, given that the former requires smaller sample sizes to achieve similar power characteristics (see Figure 1).

  • In general it seems reasonable to design experiments based on average power for Holm-corrected tests, given that the resulting worst-case power will not be much lower than the nominal value , and the number of instances is substantially lower than what would be required for Bonferroni-corrected tests (see Figure 2).

  • If worst-case power must be guaranteed, the number of instances can be estimated using the formulas for the Bonferroni case, but the actual analysis should still be conducted using Holm’s method, which in this case will result in better statistical power for all comparisons.

  • Whenever possible, all vs. one comparisons should be used instead of all vs. all, since they result in fewer hypotheses being tested and, consequently, smaller sample sizes (or larger power if sample sizes are constant).

  • If one-sided alternative hypotheses make sense in the experimental context of all vs. one comparisons they should also be preferred, since this also improves efficiency (in terms of requiring fewer instances to achieve a predefined power).

It is worth repeating here an important point also highlighted in our previous work [15]: the sample sizes calculated with the methodology presented in this section represent the smallest sample size required for a given experiment to present desired statistical properties. This can be particularly useful when, e.g., designing new benchmark sets, sampling a portion of an existing (possibly much larger) set of test functions as part of algorithm development, or designing experiments in computationally expensive contexts. If more instances are available in a particular experiment they can of course be used, which will result in experiments with even higher statistical power and can enable the execution of finer analyses, e.g., to investigate the effect of dimension or other instance characteristics on performance. However, even if that is the case, we argue that the proposed methodology can still be useful for a number of reasons:

  • It can be used by researchers to obtain a more solid understanding of the statistical properties of their experiments (e.g., by examining power effect size curves at a given sample size).

  • By defining a MRES a priori, the proposed approach provides a sanity check for researchers in overpowered experiments, with can result in arbitrarily-low p-values even for minuscule differences of no practical consequence [46]. That is, it provides a practical relevance aspect, to contrast with the statistical significance of the results.

  • Even if the calculation of the number of instances is considered unnecessary in the context of a given experiment, the methodology for estimating the number of repeated runs of each algorithm on each instance can still be used to provide a statistically principled way to generate the data.

In the next section we provide an example of application of the proposed methodology for investigating the contribution of different neighbourhood structures for a scheduling problem. Please notice, however, that the main objective of the experiment is to demonstrate the capabilities of the proposed methodology for defining the sample sizes, and not necessarily to answer specific questions regarding the algorithms used in the example.

5 Application Example

5.1 Problem Description

In the unrelated parallel machine scheduling problem (UPMSP) with sequence dependent setup times [32, 42, 58] a number of jobs, , needs to be processed by a number of machines, , minimising the completion time of the last job to leave the system, i.e., the makespan. This scheduling problem also presents a few other specificities which are not relevant for the current discussion, but are presented in detail in the relevant literature [58, 54].

So far the best algorithm for the solution of this problem seems to be a finely-tuned Simulated Annealing (SA) proposed by Santos et al. in 2016 [54]. Santos’ SA was tested on a benchmark set originally introduced by Vallada and Ruiz [58], which is composed of 200 tuning instances with jobs and machines. For each pair, eight instances with different characteristics (e.g., distribution of setup times) are present. A larger set, composed of 1000 instances, is also available to test hypotheses derived using the tuning set.

One of the core aspects of Santos’ SA is the use of six neighbourhood structures to explore the performance landscape of the problem, namely Shift, Two-shift, Task Move, Swap, Direct Swap, and Switch. At each iteration one of these perturbation functions is selected randomly and generates a new candidate The authors do not provide any discussion related to how much each of these perturbation functions affects the performance of the algorithm. However, preliminary results [45, 51] suggest that the six neighbourhood structures have considerably different effects on the algorithm, with Task move having a much more critical influence than the others.

5.2 Experimental questions

To further investigate the effect of the different neighbourhoods on the performance of the algorithm, we designed an experiment in which the full algorithm (i.e., equipped with all six neighbourhood structures) was compared against variants with perturbation functions suppressed from the pool of available movements. We focussed on removing at most two perturbations from the pool, resulting in 21 algorithm variants (6 without a single perturbation, and 15 without two perturbations) which enabled the quantification of contributions of neighbourhood structures to the algorithm’s ability to navigate the performance landscape of this particular problem class.

Given that we wanted to investigate the effects of removing these neighbourhood functions from the full algorithm, an all vs. one design was be the most appropriate, resulting in a total of 21 hypotheses to be tested. Also, the variability in instance hardness suggested the use of percent differences (see Section 3.2.2), which quantify differences in a scale that is usually more consistent across heterogeneous instances.

5.3 Experimental parameters

We set the MRES at a (reasonably high) value of , i.e., we were interested in detecting variations in the mean of percent differences of at least half a standard deviation. We decided to design our experiment with a mean power , under a familywise significance level using Holm’s correction. To prevent excess variability in the estimates of within-instance differences from inflating the overall residual variance of the experiment, we set the desired accuracy for the estimates as , i.e., a error in the estimates of paired percent differences on each instance. This relatively good accuracy can also be useful if we later want to further investigate the differences on individual instances, or even instance subgroups.

The method described in Section 4 yielded instances as the smallest amount necessary to obtain an experiment with the desired properties.101010The full methodology for calculating the number of instances and algorithm runs is implemented in the R package CAISEr version 1.0.5, available at The Java implementation of Santos’ algorithm, as well as the instances used, can be retrieved online from At this point we could have opted to use any number of instances greater than or equal to this value, up to the full available set of 200 [58]. We opted to use only the suggested value of 57 not only because it was enough to guarantee the desired statistical properties for our experiment, but also to maintain a reserve of “untouched” tuning instances, which could be useful for further testing of eventual proposals of algorithmic improvement.111111This is also a way to safeguard against overfitting our algorithms to the benchmark set, which is a common problem in algorithm design for problems with standard benchmark sets, such as the one being considered here [6, 35, 36].

The algorithms (full algorithm plus the 21 variants) were sampled on each instance under a maximum budget of runs per instance, using an initial sampling of runs/algorithm/instance.

5.4 Experimental results

Figure 3 shows the standard errors obtained for all algorithm pairs, as well as the total sample sizes generated, for each instance used in this experiment. It is very clear that the proposed approach for calculating the number of runs was able to control the standard errors at the desired level of using, in most cases, a fraction of the total available computational budget. In only three instances was the available budget insufficient to bring all standard errors under the desired threshold, but even in these cases the resulting standard errors were controlled at relatively low values. These three particular instances all had jobs (one with and two with machines), which suggests further investigation into the behaviour of the methods tested on this particular subset of instance sizes.

Figure 3: Standard errors of estimation of the percent differences in performance between the full algorithm and each of the 21 variants, for all instances tested. The numbers at the top indicate the total number of runs performed on each instance, and the labels on the x-axis provide the instance identifiers. Notice that only in three cases the allocated budget was insufficient to reduce all standard errors below the preset limit of .
Figure 4: Distribution of , i.e., of the number of runs for each algorithm variant on each instance. The majority of runs were allocated either to the full algorithm - which was involved in all comparisons - or to variants that had the Task Move (TSK) suppressed, suggesting a strong effect of this perturbation function.

Figure 4 shows the distribution of sample sizes for each algorithm involved in this experiment. As expected, the full algorithm received a reasonably large number of observations for the majority of instances, since it was involved in all comparisons. Another feature that becomes quite clear in this figure is that all versions in which the Task Move (TSK) neighbourhood function was suppressed also received much more observations, which suggests this neighbourhood as strongly influential and corroborates preliminary results obtained for Santos’ SA on the UPMSP [51, 45].

The results of the pairwise comparisons are shown in Figure 5

, which provides the confidence intervals (adjusted for a familywise error rate of

) for all comparisons between the full algorithm and the suppressed variants. As indicated by the relative importance attributed to the methods by the sampling method, the suppression of the TSK neighbourhood function led to strong degradations of performance, with mean percent losses around or more. Also of interest in this table is the last column, which shows that the desired statistical properties of the experiment - particularly the ability to detect effects at or above approximately - were actually obtained in the designed experiment.

Figure 5: Point estimates and confidence intervals (95% joint confidence level) for the mean of paired percent differences of performance for the 21 comparisons of algorithm variants against the full simulated annealing. Red intervals with a diamond marker indicate statistically significant results, while green intervals with circle markers indicate those where the null hypotheses were not rejected. Apart from the variants without the Task Move perturbation function (TSK), all others resulted only in minor performance degradation, if any.

Of possibly equal interest in terms of understanding the contributing factors to algorithm performance is the fact that the algorithm did not experience significant performance losses after the removal of some of its neighbourhood structures. Table 1 presents the results of the t-tests121212Graphical analysis of residuals indicated that the sampling distributions of means in all cases were sufficiently close to normality, allowing the use of t-tests for inference. on the mean of paired percent differences of performance. Examining the bottom rows of this table, it shows that the suppression of Swap (SWP), Switch (SWT) and Shift (SHF), individually as well as in pairs, resulted in no significant differences in performance. The width of the associated confidence intervals also suggests that any effects that may have gone undetected must be rather small.

Comparison p-value Confidence interval
Full no-TSK
Full no-2SH
Stop rejecting
Full no-DSW
Full no-SHF
Full no-SWT
Full no-SWP
Table 1: Summary of the 21 comparisons against the Full algorithm, using Holm’s step-down procedure.

To further explore these initial insights a follow-up experiment was performed. The objective of this second analysis was to obtain a more precise estimate of the effect (or lack thereof) of removing these three neighbourhood functions, individually as well as in pairs, from the pool of movements available to the algorithm. Further, we also added the variant generated by simultaneously removing these three functions, which was not present in the first experiment. This means that this follow-up experiment involved 7 variants being compared against the Full algorithm. The full -instance test set was employed for this follow-up experiment, which provides a mean power of to detect differences as small as . This was calculated by deriving the curve shown in Figure 6, using the same approach described in Section 4.

Figure 6: Power curve for the follow-up experiment, obtained by setting and iterating over effect sizes to calculate the corresponding mean power of the experiment to detect different values of .

Table 2 and Figure 7 summarise the results obtained for the follow-up experiment. Notice that even with the increased sensitivity of the tests to the mean of percent differences for the variants that had the SWP, SHF and SWT neighbourhood functions removed from the pool of possible movements the tests failed to detect any statistically significant differences. This result, coupled with the narrow confidence intervals, indicates that any effect that these neighbourhood functions may be having on the performance of the algorithm are probably very small, if they exist at all. Based on these results, algorithm designers and developers could expand this exploration even further by, for instance, (i) focussing on the structures that are being explored by the remaining neighbourhood functions, so as to gain better understanding of the problem’s landscape, or (ii) perform further exploration into instance subsets, to investigate whether the systematic lack of effect observed in this experiment also occurs, e.g., when conditioning the results on problem size. These further explorations are, however, outside the scope of the current paper.

Comparison p-value Confidence interval
Full no-SWP
Full no-SHF
Full no-SWT
Table 2: Summary of the 7 follow-up comparisons against the Full algorithm. None of the results was statistically significant at the joint confidence level.
Figure 7: Joint confidence intervals and sample sizes in the follow-up experiment. Notice that no comparison yielded statistically significant results, and that the narrow confidence intervals suggest that even if the neighbourhoods tested have some effect on the algorithm performance it is likely to be of minor consequence.

6 Conclusions

In this work we expanded methodology first introduced in [15] to enable sample size calculations for comparative experiments involving an arbitrary number of algorithms for the solution of a problem class of interest. The two aspects of sample size estimation for the comparative performance experiments, namely the number of instances and the amount of runs allocated to each algorithm on each instance, were addressed in a statistically principled manner.

The number of instances is determined based on the desired sensitivity (in terms of statistical power) for the detection of effects larger than a predefined minimally relevant effect size (MRES), under a given family-wise confidence level. Holm’s step-down procedure was employed to control the rate of false positives while enabling the design of experiments at desired levels of power for the best, worst, median or mean case. The proposed methodology is based on the assumptions that underlie the t-tests, but the results can be easily generalised to the most common non-parametric tests (e.g., Wilcoxon signed-ranks test) based on asymptotic relative efficiency, as discussed previously in [15].

The proposed iterative sampling method used to determine the number of runs of each algorithm on each instance is based on the interpretation of the standard error of estimation as a measure of precision, and generating samples aimed at minimising the total amount of runs necessary to obtain estimates with standard errors below predefined thresholds. Optimal sample size ratios were derived for simple and percent differences between the mean of all pairs of algorithms on any given instance, and a sampling heuristic was proposed that enables generalisation for other statistics that may be of eventual interest to researchers, such as the variance, algorithm rankings etc.

An example of application was provided in Section 5, using the publicly available R package CAISEr [14], which provides an easy to use implementation of the proposed methodology. The example focussed on the comparison of a state-of-the-art heuristic for the solution of a class of scheduling problems against 21 algorithm variants generated by suppressing different neighbourhood functions used to generate candidate solutions. The preliminary results indicated not only the most relevant neighbourhood function but also the apparent lack of effect of three of the six ones commonly employed. The first part of the experiment illustrated the ability of the proposed method to sample the algorithms so as to control the standard errors at each instance under the desired accuracy threshold. It also showcased the compliance between the MRES used for designing the experiment, and the effect sizes that the hypothesis tests were able to detect as statistically significant.

The results obtained in the first part of the experiment were used to design a follow-up investigation, in which the effects of the three neighbourhood structures flagged in the initial part as not statistically significant were further investigated. This follow-up experiment used the full available instance set, and was used to showcase the use of the proposed methodology in a fixed sample size situation. The results obtained not only corroborated those obtained in the first part, but further reinforced the tentative conclusion that the effect of using the three neighbourhood functions as part of the algorithm must be minor at best, which suggests a few possibilities of exploration for algorithm designers and developers.

6.1 Limitations and Possibilities

As in our previous work upon which the present paper was based, it is important to reinforce at that the proposed methodology is not the definitive way to compare algorithms. For instance, convergence analysis, algorithm reliability, and performance analysis conditional on problem characteristics are research questions that require different methodologies. The proposed methodology is simply an additional tool in the research arsenal, one that can provide answers to several common questions in the experimental comparison of algorithms.

One of the most straightforward extensions of the work presented in this paper is the incorporation of sequential analysis approaches [12, 3] to enable further reductions in the required number of instances, which is possible in cases where the actual effect sizes are substantially larger than the MRES defined in the experimental design. Extending the sample size calculations performed here to Bayesian alternatives [13]

is also relatively straightforward, and may allow the aggregation of existing knowledge into the comparison of algorithms, in the form of prior distributions. Bayesian methods may also allow for an easier incorporation of sequential analysis into the sample size calculation methodologies - since the incremental aggregation of observations in Bayesian inference does not require further significance corrections

[41] - and may even enable the incorporation of existing published evidence, which would be another step towards the development of meta-analytical tools for algorithmic research.


  • [1] del Amo, I.G., Pelta, D.A., González, J.R., Masegosa, A.D.: An algorithm comparison for dynamic optimization problems. Applied Soft Computing 12(10), 3176–3192 (2012)
  • [2] Barr, R.S., Golden, B.L., Kelly, J.P., Resende, M.G.C., Stewart, W.R.: Designing and reporting on computational experiments with heuristic methods. Journal of Heuristics 1(1), 9–32 (1995)
  • [3] Bartroff, J., Lai, T., Shih, M.C.: Sequential Experimentation in Clinical Trials: Design and Analysis. Springer (2013)
  • [4]

    Bartz-Beielstein, T.: New Experimentalism Applied to Evolutionary Computation.

    Ph.D. thesis, Universität Dortmund, Germany (2005)
  • [5] Bartz-Beielstein, T.: Experimental Research in Evolutionary Computation. Springer (2006)
  • [6] Bartz-Beielstein, T.: How to create generalizable results. In: J. Kacprzyk, W. Pedrycz (eds.) Handbook of Computational Intelligence. Springer (2015)
  • [7] Bartz-Beielstein, T., Chiarandini, M., Paquete, L., Preuss, M.: Experimental Methods for the Analysis of Optimization Algorithms. Springer (2010)
  • [8] Benavoli, A., Corani, G., Mangili, F., Zaffalon, M., Ruggeri, F.: A bayesian wilcoxon signed-rank test based on the dirichlet process.

    In: 30th International conference on machine learning, pp. 1026–1034 (2014)

  • [9] Birattari, M.: On the estimation of the expected performance of a metaheuristic on a class of instances: how many instances, how many runs? Tech. Rep. IRIDIA/2004-001, Université Libre de Bruxelles, Belgium (2004)
  • [10] Birattari, M.: Tuning Metaheuristics – A Machine Learning Perspective. Springer Berlin Heidelberg (2009)
  • [11] Birattari, M., Dorigo, M.: How to assess and report the performance of a stochastic algorithm on a benchmark problem: Mean or best result on a number of runs? Optimization Letters 1, 309–311 (2007)
  • [12] Botella, J., Ximénez, C., Revuelta, J., Suero, M.: Optimization of sample size in controlled experiments: The CLAST rule. Behavior Research Methods 38(1), 65–76 (2006)
  • [13] Calvo, B., Shir, O.M., Ceberio, J., Doerr, C., Wang, H., Bäck, T., Lozano, J.A.: Bayesian performance analysis for black-box optimization benchmarking. In: Proceedings of the Genetic and Evolutionary Computation Conference, GECCO ’19, pp. 1789–1797. ACM (2019)
  • [14] Campelo, F.: CAISEr: Comparison of Algorithms with Iterative Sample Size Estimation (2019). URL Package version 1.0.13
  • [15] Campelo, F., Takahashi, F.: Sample size estimation for power and accuracy in the experimental comparison of algorithms. Journal of Heuristics 25(2), 305–338 (2019)
  • [16]

    Carrano, E.G., Wanner, E.F., Takahashi, R.H.C.: A multicriteria statistical based comparison methodology for evaluating evolutionary algorithms.

    IEEE Transactions on Evolutionary Computation 15(6), 848–870 (2011)
  • [17] Chimani, M., Klein, K.: Algorithm engineering: Concepts and practice. In: Experimental Methods for the Analysis of Optimization Algorithms, pp. 131–158. Springer Berlin Heidelberg (2010)
  • [18] Coffin, M., Saltzman, M.J.: Statistical analysis of computational tests of algorithms and heuristics. INFORMS Journal on Computing 12(1), 24–44 (2000)
  • [19]

    Czarn, A., MacNish, C., Vijayan, K., Turlach, B.: Statistical exploratory analysis of genetic algorithms: the importance of interaction.

    In: Proceedings of the 2004 IEEE Congress on Evolutionary Computation. Institute of Electrical & Electronics Engineers (IEEE) (2004)
  • [20] Demšar, J.: Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research 7, 1–30 (2006)
  • [21] Derrac, J., García, S., Hui, S., Suganthan, P.N., Herrera, F.: Analyzing convergence performance of evolutionary algorithms: A statistical approach. Information Sciences 289, 41–58 (2014)
  • [22] Derrac, J., García, S., Molina, D., Herrera, F.: A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms. Swarm and Evolutionary Computation 1(1), 3–18 (2011)
  • [23] Dunn, O.J.: Multiple comparisons among means. Journal of the American Statistical Association 56(293), 52–64 (1961)
  • [24] Eiben, A., Jelasity, M.: A critical note on experimental research methodology in EC. In: Proceedings of the 2002 IEEECongress on Evolutionary Computation. Institute of Electrical & Electronics Engineers (IEEE) (2002)
  • [25] Ellis, P.D.: The Essential Guide to Effect Sizes, 1st edn. Cambridge University Press (2010)
  • [26] Fieller, E.C.: Some problems in interval estimation. Journal of the Royal Statistical Society. Series B (Methodological) 16(2), 175–185 (1954)
  • [27] Franz, V.: Ratios: A short guide to confidence limits and proper use (2007).
  • [28] García, S., Fernández, A., Luengo, J., Herrera, F.: A study of statistical techniques and performance measures for genetics-based machine learning: accuracy and interpretability. Soft Computing 13(10), 959–977 (2009)
  • [29] García, S., Fernández, A., Luengo, J., Herrera, F.: Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power. Information Sciences 180(10), 2044–2064 (2010)
  • [30] García, S., Molina, D., Lozano, M., Herrera, F.: A study on the use of non-parametric tests for analyzing the evolutionary algorithms’ behaviour: a case study on the CEC’2005 Special session on real parameter optimization. Journal of Heuristics 15(6), 617–644 (2008)
  • [31] Gelman, A., Hill, J.: Data analysis using regression and multilevel/hierarchical models. Cambridge university press (2006)
  • [32] Graham, R.L., Lawler, E.L., Lenstra, J.K., Rinnooy Kan, A.H.G.: Optimization and approximation in deterministic sequencing and scheduling: a survey. Annals of Discrete Mathematics 5, 287–326 (1979)
  • [33] Hansen, N., Tǔsar, T., Mersmann, O., Auger, A., Brockoff, D.: COCO: The experimental procedure (2016). URL
  • [34] Holm, S.: A simple sequentially rejective multiple test procedure. Scandinavian journal of statistics 6(2), 65–70 (1979)
  • [35] Hooker, J.N.: Needed: An empirical science of algorithms. Operations Research 42(2), 201–212 (1994)
  • [36] Hooker, J.N.: Testing heuristics: We have it all wrong. Journal of Heuristics 1(1), 33–42 (1996)
  • [37] Hurlbert, S.H.: Pseudoreplication and the design of ecological field experiments. Ecological Monographs 54(2), 187–211 (1984)
  • [38] Jain, R.K.: The Art of Computer Systems Performance Analysis. John Wiley and Sons Ltd (1991)
  • [39] Johnson, D.: A theoretician’s guide to the experimental analysis of algorithms. In: M. Goldwasser, D. Johnson, C. McGeoch (eds.) Data Structures, Near Neighbor Searches, and Methodology: Fifth and Sixth DIMACS Implementation Challenges, DIMACS Series in Discrete Mathematics and Theoretical Computer Science, vol. 59, pp. 215–250. American Mathematical Society (2002)
  • [40] Krohling, R.A., Lourenzutti, R., Campos, M.: Ranking and comparing evolutionary algorithms with hellinger-TOPSIS. Applied Soft Computing 37, 217–226 (2015)
  • [41] Kruschke, J.K.: Doing Bayesian Data Analysis: A Tutorial with R and BUGS, 1st edn. Academic Press, Inc. (2010)
  • [42] Lawler, E.L., Lenstra, J.K., Rinnooy Kan, A.H., Shmoys, D.B.: Sequencing and scheduling: Algorithms and complexity. In: Handbooks in Operations Research and Management Science, vol. 4, chap. 9, pp. 445–522. Elsevier (1993)
  • [43] Lazic, S.E.: The problem of pseudoreplication in neuroscientific studies: is it affecting your analysis? BMC Neuroscience 11(5), 397–407 (2010)
  • [44] Lenth, R.V.: Some practical guidelines for effective sample size determination. The American Statistician 55(3), 187–193 (2001)
  • [45] Maravilha, A.L., Pereira, L.M., Campelo, F.: Statistical characterization of neighborhood structures for the unrelated parallel machine problem with sequence-dependent setup times. In preparation
  • [46] Mathews, P.: Sample Size Calculations: Practical Methods for Engineers and Scientists, 1st edn. Matthews Malnar & Bailey Inc. (2010)
  • [47] McGeoch, C.C.: Feature article—toward an experimental method for algorithm simulation. INFORMS Journal on Computing 8(1), 1–15 (1996)
  • [48] Millar, R., Anderson, M.: Remedies for pseudoreplication. Fisheries Research 70, 397–407 (2004)
  • [49] Montgomery, D.C.: Design and Analysis of Experiments, 8th edn. John Wiley & Sons (2013)
  • [50] Montgomery, D.C., Runger, G.C.: Applied Statistics and Probability for Engineers, 6th edn. Wiley (2013)
  • [51] Pereira, L.M.: Análise de Estruturas de Vizinhança para o Problema de Sequenciamento de Máquinas Paralelas Não Relacionadas com Tempos de Preparação . Master’s thesis, Universidade Federal de Minas Gerais, Belo Horizonte, Brazil (2019). URL In Portuguese
  • [52] R Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2017). URL
  • [53] Ridge, E.: Design of Experiments for the Tuning of Optimisation Algorithms. Ph.D. thesis, The University of York, UK (2007)
  • [54] Santos, H.G., Toffolo, T.A., Silva, C.L., Berghe, G.V.: Analysis of stochastic local search methods for the unrelated parallel machine scheduling problem. International Transactions in Operational Research 00 (2016)
  • [55] Shaffer, J.P.: Multiple hypothesis testing. Annual review of psychology 46(1), 561–584 (1995)
  • [56] Sheskin, D.J.: Handbook of Parametric and Nonparametric Statistical Procedures. Taylor & Francis (2011)
  • [57] Sörensen, K., Sevaux, M., Glover, F.: A history of metaheuristics. In: Handbook of Heuristics, pp. 1–18. Springer International Publishing (2018)
  • [58] Vallada, E., Ruiz, R.: A genetic algorithm for the unrelated parallel machine scheduling problem with sequence dependent setup times. European Journal of Operational Research 211(3), 612–622 (2011)
  • [59] Yuan, B., Gallagher, M.: Statistical racing techniques for improved empirical evaluation of evolutionary algorithms. Parallel Problem Solving From Nature - PPSN VIII 3242, 172–181 (2004)
  • [60] Yuan, B., Gallagher, M.: An improved small-sample statistical test for comparing the success rates of evolutionary algorithms. In: Proceedings of the 11th Annual conference on Genetic and evolutionary computation - GECCO09. Association for Computing Machinery (ACM) (2009)