1 Introduction
Experiments are commonplace in SE [sjoberg2005survey, stol2018abc, kitchenham2015evidence]. Still, two main shortcomings usually have an impact on their suitability for evaluating the effectiveness of SE technologies [kitchenham2015evidence]: (1) sample sizes are usually small [dybaa2006systematic], and (2) results are only generalizable to the configuration of the experimental settings [wohlin2012experimentation].
With the aim of increasing the reliability and generalizability of individual experiment results, SE researchers are working on building groups of experiments by means of replication (conducting groups of replications) [munoz2010family, canfora2005family, mouchawrab2011assessing, kosar2012program, abrahao2013assessing]. By collaborating with each other (e.g., sharing experimental material, and assisting each other during the design, execution and/or analysis phase of experiments, etc.), researchers are able to increase the sample size, as well as evaluate the effects of the treatments under different settings. This should increase the reliability of results and their generalizability to different contexts and populations [basili1999building].
Groups of replications provide some advantages for evaluating the effectiveness of SE treatments [cooper2009relative, stewart2002ipd, lyman2005strengths, debray2015get, biondi2016umbrella]: (1) access to raw data provides for the use of consistent preprocessing and analysis techniques to analyze each experiment, thereby increasing the reliability of joint conclusions; (2) researchers conducting groups of replications may limit the changes made across the replications in order to increase the internal validity of joint conclusions; (3) joint conclusions are not affected by the detrimental effects of publication bias, as groups of replications do not rely on already published results; (4) consistent measurement instruments can be used across replications in order to measure participant characteristics with identical methods and scales and, possibly, stratify the results according to such characteristics.
According to a systematic mapping study (SMS) that we undertook [adrisms], five techniques are being applied to aggregate groups of SE replications (listed from most to least used): narrative synthesis, aggregated data, megatrial or stratified individual participant data, and aggregation of values. According to the literature of mature experimental disciplines like medicine and pharmacology [brown2014applied, whitehead2002meta], some aggregation techniques are more suitable than others depending on the characteristics of the group of replications. We observed similar findings when applying the aggregation techniques to analyze a stereotypical group of SE replications [santos2018comparing]
(a small group of replications with small and dissimilar sample sizes, opportunistic participant recruitment, different types of subjects, identical experimental designs and response variable operationalizations, and heterogeneous results
[adrisms]). The applied aggregation technique definitely had a big impact on the reliability of joint conclusions [santos2018comparing].In view of this, the aim of this study is to answer the following main research question:

How should groups of SE replications be aggregated?
To answer this question, we performed a literature review in mature experimental disciplines (medicine and pharmacology) to learn about the techniques applied to aggregate replication results. Along the way, we also noticed some (more profound than expected) differences between groups of replications in SE and medicine. This led to a subsequent literature review of studies with similar circumstances to SE in the fields of medicine, social research, educational research and econometrics. In view of the results, we tailored a procedure, with a set of embedded guidelines, to facilitate the aggregation of results in groups of SE replications. We apply the proposed procedure to analyze a group of replications in order to illustrate its use.
This article extends our prior work (i.e., [adrisms, santos2018comparing]) in several ways: (1) by identifying differences between the characteristics of groups of replications in the fields of SE and medicine and how such differences may impact the aggregation techniques used to analyze groups of SE replications; (2) by proposing a stepbystep procedure to analyze groups of SE replications that takes into account such differences and the typical limitations regarding joint data analysis of groups of SE replications; (3) by providing a handson tutorial with mathematical formulae and R code snippets to analyze a stereotypical group of SE replications; (4) by providing a discussion and further pointers to references indicating how to analyze groups of replications with different experimental designs.
The takeaway messages of this research are:

Randomeffects models should be preferred over fixedeffects models, especially because many variables may impact SE experiment results, changes are frequent across SE replications, and heterogeneous results are commonplace. Differences between groups of replications in medicine and SE make it inappropriate to directly apply medical guidelines to the analysis of groups of SE replications. In particular, the application of fixedeffects models and traditional statistical thresholds (e.g., the traditional value of 0.05) in order to detect heterogeneity and moderators does not appear to provide guarantees in SE.

Avoid narrative synthesis [popay2006guidance], aggregation of values [borenstein2011introduction] and megatrial individual participant data (IPDMT) [field2012discovering], and use aggregated data (AD) [kitchenham2004procedures] and stratified individual participant data (IPDS) [simmonds2005meta] in tandem instead. AD and IPDS appear to be the most suitable techniques for analyzing groups of SE replications. AD provides intuitive visualizations to convey joint results and straightforward statistics to quantify heterogeneity. IPDS increases the interpretability of joint results and offers greater statistical flexibility.

Strive to identify both experimentlevel moderators^{1}^{1}1Variables that cause an effect to differ across contexts [krein2016multi]. and participantlevel moderators. AD and IPDS appear to be good at identifying experimentlevel moderators. IPDS appears to be preferable for identifying participantlevel moderators.

Use the following fourstep procedure to analyze a stereotypical group of SE replications
: (1) describe the characteristics of the participants using appropriate descriptive statistics and visualizations; (2) use consistent statistical techniques to preprocess, describe and analyze the data of each replication; (3) select suitable aggregation techniques to provide joint conclusions; and (4) conduct exploratory analyses to identify experimentlevel moderators
^{2}^{2}2Characteristics of the experiments that may be impacting the results, such as the programming language or experimental session length of the experiments. and participantlevel moderators^{3}^{3}3Characteristics of the participants that may be impacting the results, such as participant programming or Java experience..
The paper has been organized as follows. In Section 2 we provide the background of this study. In Section 3 we outline the research method that we followed to elaborate the proposed analysis procedure, and present the stereotypical group of replications that we will use to illustrate its application. In Section 4 we show the differences between groups of replications in medicine and SE. Then, in Section 5 we outline the most common limitations with regard to joint data analysis of groups of SE replications. In Section 6 we provide an overview of the fourstep analysis procedure that we propose to analyze groups of SE replications. In Sections 7, 8, 9 and 10 we detail each of the steps of the procedure that we apply to the illustrative group of replications. We outline the threats to validity of this study in Section 11, and provide further pointers for the analysis of groups of SE replications with different experimental designs in Section 12. We relate our research to other SE research in Section 13. Section 14 states our conclusions.
2 Background
2.1 Replication
The relevance of replication has been widely acknowledged in SE [shull2008role, kitchenham2008role]. Replication has been coupled in SE with the concept of applying a similar experimental procedure to the one applied in a previous baseline experiment on a different sample of participants to generate new raw data [da2014replication, bezerra2015replication].
However, two different concepts need to be set apart: replication and reproducibility—or reproduction of results. Researchers who want to reproduce results apply the same analysis procedure followed by the original experimenters to the original raw data with the aim of getting the same results [amann2013software]. Therefore, reproduction has to do with reanalysis of raw data from the baseline experiment [amann2013software]. Replications generate new raw data —and results— that can be later combined with the outcomes of other replications to provide joint conclusions. This research focuses on replication.
Different types of replications can be conducted. According to Gomez et al. [gomez2014understanding], replication types vary along a continuum: from exact replications, following exactly the same experimental configurations as their baseline experiments, to conceptual replications, where the only thing that the replications have in common are the baseline experiment research questions and objectives. Somewhere in between these two extremes lie other replication types, where different elements of the baseline experiment configurations remain unchanged [baldassarre2014replication].
Laboratory packages were proposed to ease replication across research groups and institutions [shull2004knowledge]. Laboratory packages contain relevant information needed to replicate an experiment [solari2017content]. With a laboratory package, an external group of researchers can reproduce the settings of a baseline experiment, and gather new raw data from a different sample of participants. In addition to sharing laboratory packages, experimenters conducting groups of replications may also collaborate with each other through facetoface or Internet meetings to plan, design, execute and/or analyze their experiments [juristo2013communication]. This close collaboration may increase the chances of getting similar results across the replications—as experimenter interaction is expected to assure more similar experimental procedures. This should increase the reliability of joint conclusions [borenstein2011introduction].
Despite even the hardest efforts to conduct exact replications, conflicting results may still pile up [borenstein2011introduction, cumming2013understanding]. It is then that firsthand knowledge of experiment configurations and participant characteristics plays a central role. If such information is known, it is easier to hypothesize on the variables that may be behind divergent results. In turn, this is useful for hypothesizing on experimentlevel or participantlevel moderators that may be influencing the results. It is the above flexibility that leads this research to focus on groups of replications.
2.2 Groups of Replications
We conducted a SMS with the aim of learning what aggregation techniques are being used to analyze groups of SE replications [adrisms]. We identified a total of 39 groups of replications that share certain characteristics:

They are either conducted by individual researchers or by groups of researchers working in close collaboration across one or multiple research groups, universities and/or institutions. As such, researchers have access to the raw data of all the replications.

They are formed opportunistically. In other words, a priori plans are not typically set for building groups of replications; each replication comes into being individually without a defined protocol at the inception of the group. As a consequence, replications are aggregated —generally after having being published individually— to either increase the reliability of the findings or to elicit moderators (e.g., assessing how the technologies perform for different types of subjects).

Most groups of replications are composed of three to five replications evaluating the performance of a binary treatment (e.g., Method A vs. Method B) on a continuous outcome of interest (e.g., productivity measured as LOC per hour). Replications are usually small^{4}^{4}4Out of consistency with other SE authors [kitchenham2016robust], small sample size refers throughout this article to experiments involving fewer than 30 subjects., have dissimilar sample sizes, evaluate the performance of the treatments for different types of subjects (e.g., professionals vs. students), have identical experimental designs and response variable operationalizations^{5}^{5}5Therefore, internal and construct threats to validity are not mitigated or new ones cannot be identified., and provide heterogeneous results.
2.3 Aggregation Techniques
Five aggregation techniques have been used in groups of SE replications [adrisms]: narrative synthesis, AD, IPDMT, IPDS and aggregation of values. Thirtyfive percent of the groups of replications use more than one aggregation technique. However, they usually serve different purposes: one for providing joint conclusions, and a different one for eliciting moderators. Thus, the groups of SE replications never compare the results achieved with different aggregation techniques for the same objective.
In the following, we review the aggregation techniques used in groups of SE replications starting with the most, and ending with the least, popular.^{6}^{6}6The respective percentage use of the aggregation techniques sum more than 100% because 16 groups of replications used more than one aggregation technique. To do this, we rely on the results of the SMS that we undertook [adrisms].
Narrative synthesis was used to analyze 46% (18 out of 39) of the groups of replications [adrisms]. In narrative synthesis (also known as semiquantitative aggregation [borenstein2011introduction]), replication results (in either value or effect size terms) are combined textually to provide a summary of results. For example, it is common in SE to analyze each replication individually using a test or a Wilcoxon test and to then provide a textual summary of results of the replications as follows: ”…while the results are statistically significant/large/negative in experiments X, Y and Z, they are not in experiment M. This difference of results could have been caused by H, N or K moderator variable…”.
AD—commonly known as metaanalysis of effect sizes in SE [kitchenham2004procedures]—was used to analyze 38% (15 out of 39) of the groups of replications [adrisms]
. In AD, all replication effect sizes are first computed from summary statistics, such as means, variances or sample sizes—or from experiment statistical test results—and then combined by means of a metaanalysis model
[borenstein2011introduction]. Two different types of metaanalysis models can be fitted: fixedeffects models or randomeffects models [borenstein2011introduction]. Fixedeffects models assume that all the experiments estimate a common population effect size and, thus, differences across experiment results arise due to the natural variation of results (i.e., due to the different participant samples in the experiments). On the other hand, randomeffects models assume that differences across experiment results arise not just from the natural variation of results, but also from a real heterogeneity of effects. In other words, randomeffects models estimate a distribution of population effect sizes rather than a common population effect size.
IPDMT was used to analyze 33% (13 out of 39) of the groups of replications [adrisms]. In IPDMT, the raw data of all the experiments are analyzed jointly as if the raw data came from one big experiment. Since IPDMT depends on the availability of raw data, researchers typically first analyze each replication individually [field2012discovering] to later perform IPDMT by pooling and analyzing the raw data of all the replications applying the same statistical test that was used for performing the individual analyses.
IPDS was used to analyze 15% (6 out of 39) of the groups of replications [adrisms]
. In IPDS all experiment raw data are analyzed jointly by acknowledging the experiment where the raw data come from. As in AD, two types of IPDS models can be fitted: fixedeffects models and randomeffects models. Commonly used fixedeffects models are linear regression models (e.g., ANOVA) with two factors: Experiment and Treatment
[whitehead2002meta]. Commonly used randomeffects models are linear mixed models with two factors: Experiment and Treatment
[whitehead2002meta].Aggregation of values was used to analyze 7% (3 out of 39) of the groups of replications [adrisms]. In aggregation of values, onesided values from all replications are pooled together by means of a statistical model such as Fisher’s or Stouffer’s method [borenstein2011introduction]. Note that values can be either available directly (have been previously reported) or computed by researchers (raw data available from each replication is first analyzed to calculate the values).
3 Research Method
We began our research by studying the recommendations and guidelines provided in medicine and pharmacology to analyze—and report—groups of replications (i.e., multicenter clinical trials [friedman1998fundamentals]). We resorted to the medical and pharmacological literature because of their longstanding experimental tradition and because SE researchers have previously looked to these disciplines for advice on how to analyze individual experiments [wohlin2012experimentation, juristo2013basics], how to conduct systematic literature reviews [kitchenham2004procedures] or how to conceptualize new research paradigms, such as evidencebased software engineering [kitchenham2004evidence] and so on.
Particularly, we began studying the recommendations and guidelines promoted by the Cochrane Association [higgins2008cochrane], the American Food and Drug Administration [anello2005multicentre], the guidelines for analyzing multicenter clinical trials (MCTs) provided by the International Conference on Harmonization [lewis1999statistical], the PRISMAIPD statement of the EQUATOR Network framework [stewart2015preferred], and the CONSORT statement for reporting randomized controlled trials [schulz2010consort]. These guidelines for analyzing MCTs are mature and have been widely used. Some have been in use for over 20 years [lewis1999statistical] and others have been referenced thousands of times [schulz2010consort].
The above guidelines contain a number of terms (e.g., multicenter, treatmentbycenter interaction, etc.), and concepts (e.g., individual participant data, aggregated data, etc.) that we used to drive a subsequent literature review. Throughout the literature review, we came across numerous references to the statistical techniques that can be used to analyze MCTs [debray2015get, fisher2011critical, simmonds2005meta, pincus2011methodological, feaster2011modeling], metaanalysis [borenstein2011introduction, whitehead2002meta, chen2013applied], hierarchical linear models [brown2014applied, hox2010multilevel, finch2014multilevel, luke2004multilevel], and study protocols [de2017rational].
After studying the above references, we identified some differences between MCTs and groups of SE replications. According to the metaanalysis and hierarchical linear models literature that we examined, these differences had statistical consequences with respect to results aggregation. This led to another literature review where we discovered studies evaluating data under circumstances more typical of SE: small number of replications with small and unbalanced sample sizes. We found a number of studies from medicine [chu2011comparing, pickering2007analysis], social research [maas2005sufficient], educational research [mcneish2016effect] and econometrics [bell2015explaining] studying exactly this. In Section 4 we outline the differences between groups of SE replications and MCTs, and the statistical consequences of such differences.
After this second literature review, where we learned about the statistical consequences of the differences between groups of SE replications and MCTs, we revisited the groups of replications that we identified during the SMS reported in [adrisms]. We compiled a list of four common limitations with regard to joint data analysis in groups of SE replications. We outline these limitations in Section 5.
We developed guidelines to tackle these limitations. We created a fourstep analysis procedure with an identical structure to those commonly followed in medicine [higgins2008cochrane, anello2005multicentre, stewart2015preferred, schulz2010consort, lewis1999statistical]. We embedded the guidelines within Steps 3 and 4 of the analysis procedure that we propose (i.e., providing either joint conclusions, or moderator effects, respectively). Before going any further, however, we should clarify that we do not aim to propose an allencompassing cookbook procedure to analyze all groups of SE replications. Our procedure can be seen as a set of minimum criteria needed to analyze a stereotypical group of SE replications (i.e., with the characteristics outlined in Section 2). In Section 6 we discuss the steps of the proposed analysis procedure, the objectives of each step, the medical guidelines recommending the respective steps, and how we adapted each step to SE by acknowledging differences between MCTs and groups of SE replications, and their common limitations with regard to joint data analysis.
Finally, we outline each of the steps of the proposed procedure. We apply the procedure to analyze a representative group of SE replications to illustrate the procedure.
The chosen group of replications focuses on testdriven development (TDD). One research question drives the group of replications: How does TDD affect quality compared to IterativeTestLast (ITL)?
The main independent variable across all replications is the development approach, with TDD and ITL as treatments. ITL is defined as the reverseorder approach of TDD (following Erdogmus et al. [erdogmus2005effectiveness]).
All experiments have an identical experimental design: an AB repeatedmeasures design [wohlin2012experimentation] (where subjects first apply ITL, and then TDD at a later date). The dependent variable within the group of replications is quality. We measured quality as the percentage of test cases that successfully pass from a battery of test cases that we built for measuring participant solutions. Specifically, we measured quality as follows:
A total of four replications were run: three at a multinational online security products company (i.e., FSecure H, FSecure K and FSecure O), and one at UPV, a Spanish university. Six, 11, 7 and 33 subjects participated in each replication, respectively.
The characteristics of this group of replications are typical in SE: a small number of replications (i.e., four replications, the median number of replications within SE groups) evaluating the performance of a binary treatment (i.e., ITL vs. TDD) on a continuous^{7}^{7}7Although the data are measured on a percentage scale (i.e., 0 to 100%), the approach taken throughout this study is to consider the data as continuous as the total number of test cases is large (i.e., greater than 30 [crawley2012r]). outcome of interest, with identical response variables and experimental designs, small and dissimilar sample sizes, different types of subjects (i.e., professionals and students) with common knowledge (i.e., weeklong training on slicing, unit testing, ITL and TDD) and development culture (test last), using the same environment (i.e., Java, JUnit, Eclipse) and showing heterogeneous results.
4 Differences between Groups of SE Replications and MCTs
Multicenter Control Trials  groups of SE replications  Statistical consequence 
✓ Identical experimental configurations  ✗ Opportunistic changes across replications   Risk of heterogeneity 
✓ Rigid & random participant selection criteria  ✗ Convenience sampling   Risk of heterogeneity 
✓ Balanced & adequate sample sizes  ✗ Unbalanced & small sample sizes   Low precision & power of fixed effects 
✓ Appropriate overall sample size  ✗ Small overall sample size   Inability to detect moderators 
After studying the guidelines and recommendations on analyzing MCTs in medicine and pharmacology, we were skeptical about their direct application for analyzing groups of SE replications due to some relevant differences between MCTs and groups of SE replications.
For example, MCTs use controlled experiments (participants are randomly assigned to treatments), while quasiexperiments (where assignment to treatments is nonrandom) are common in SE. Quasiexperimental designs usually create less compelling support for counterfactual inferences [william2002experimental]. Quasiexperimental control groups may differ from the treatment condition in many systematic (nonrandom) ways other than the presence of the treatment. Many of these differences could be alternative explanations for the observed effect and should be ruled out by researchers in order to get a more valid estimate of the treatment effect. By contrast, with random assignment, researchers do not have to think about all these alternative explanations.
Additionally, MCTs tend to have detailed protocols specifying the experimental settings under which all the experiments are to be run and the set of procedures that are to be strictly adhered to during the execution of the experiments [whitehead2002meta, bero1995cochrane, lewis1999statistical]. On the contrary, groups of SE replications are usually created ad hoc [adrisms]. In fact, changes are usually made opportunistically across the replications which are then aggregated to either provide joint results or to investigate moderators. Unfortunately, the changes typically made across SE replications may result in an unexpectedly large variation of results (i.e., statistical heterogeneity of results [borenstein2011introduction]). In practical terms, if statistical heterogeneity materializes, then this is taken as evidence that the treatments may be performing differently across the experiments [borenstein2011introduction]. It may be misleading in this case to apply fixedeffects models to aggregate the results [whitehead2002meta, borenstein2011introduction]—as is typically the case in medicine [whitehead2002meta, anello2005multicentre, phillips2003e9]—. This is because, unlike randomeffects methods, fixedeffects models provide a common effect rather than a distribution of effects as a joint conclusion [whitehead2002meta, borenstein2011introduction]. The joint conclusions of fixedeffects model may be especially misleading if results reverse across the experiments [whitehead2002meta, borenstein2011introduction]—as the averaged effect may not, ultimately, be representative for all the experiments.
We put the absence of protocols in groups of SE replications down to experimental research in SE being less mature than in medicine. Therefore, we regard this as a temporary difference since we expect SE researchers to be convinced by the advantages of developing standardized protocols (e.g., increased internal validity of results [petitti2000meta, friedman1998fundamentals]) and adopt them when conducting groups of replications.
Besides, stringent and random selection criteria are typically set to recruit the participants in MCT experiments designed to assess the efficacy of new treatments [bero1995cochrane, lewis1999statistical, schulz2010consort], like specific blood pressure parameters, lack of comorbid conditions, etc. [bero1995cochrane, lewis1999statistical, schulz2010consort]. This ensures consistent results across sites and helps to minimize the risk of confounding effects impacting results [whitehead2002meta]
. Contrariwise, SE replications rarely set stringent selection criteria for recruiting participants. Instead, participants are usually recruited using convenience sampling. Unfortunately, the different characteristics of the participants across the experiments may result in statistical heterogeneity. Once again, this is an obstacle to the application of fixedeffects models for analyzing groups of SE replications. We think that there are two grounds for the absence of strict recruiting criteria in groups of SE replications. First, SE experimental research is less mature and has not yet developed standardized measurement instruments to classify—and include/exclude participants—in SE experiments
[falessi2017empirical]. Second, there are differences between the domains of SE and medicine, where SE researchers rarely have the luxury of dismissing participants or an ample array of potential participants. Since we do not expect this to change in the short term, we consider this difference as permanent.Also, MCTs commonly undertake a planning phase where both participantlevel and experimentlevel sample sizes are calculated [bero1995cochrane, anello2005multicentre, schulz2010consort]. Participantlevel sample sizes define how many subjects are needed, whereas experimentlevel sample sizes define how many replications are needed if is only plausible to allocate X subjects to each experiment. This ensures balanced sample sizes across the experiments and proper statistical power for detecting true population effect sizes. On the contrary, sample size estimation phases are rarely undertaken in SE (considering that only one group of SE replications [laitenberger2001internally] provided any sample size requirements calculation [adrisms]). Instead, a small number of replications with small and dissimilar (i.e., convenient) sample sizes are usually run and then aggregated. This sample size estimation phase is feasible within the broader population to which medicine interventions apply as opposed to SE experiments where the population is more restricted. This more contrived sampling frame may prevent groups of SE replications from satisfying statistical power requirements. This places several limitations on the use of fixedeffects models to analyze groups of SE replications. First, fixedeffects models fit many parameters. For example, parameter estimates may be potentially biased in ANOVA models including an experiment factor, where a different parameter is fitted for each experiment [whitehead2002meta], due to experimentlevel sample size limitations. Second, groups of SE replications tend to have dissimilar sample sizes. This may prevent fixedeffects models achieving the statistical power to detect true treatment effects [chu2011comparing, localio2001adjustments]. We think that the failure in groups of SE replications to pay attention to sample size calculations may be due both to SE experimental research being less mature than medicine and to different participant recruitment opportunities between the SE and medicine domains. Therefore, it is regarded as a permanent difference.
Finally, the small sample sizes and number of replications in groups of SE replications also impact the detectability of moderators [kraemer2000pitfalls, whitehead2002meta, fisher2011critical]. In particular, larger sample sizes are usually required to detect moderators than to detect treatment effects [kraemer2000pitfalls]. Therefore, it may not be feasible in groups of SE replications to get values lower than 0.05 in order to claim that there are statistically significant moderator effects. This is especially worrying in the case of experimentlevel moderators, as, in most cases, only a few data points are available for moderator detection. It may not be feasible to identify moderators in groups of SE replications unless statistical significance thresholds are adapted (e.g., by increasing them from 0.05 to 0.10 [whitehead2002meta]). Unfortunately, this comes at the cost of a larger proportion of statistical errors [whitehead2002meta, quinn2002experimental]. In our opinion, the inability of groups of SE replications to detect moderators may be due both to SE experimental research being less mature than medicine (as, after all, moderators could be identified if sample size calculations were made [ensor2018simulation]) and differences between SE and medicine (again, in terms of resources). Therefore, it is regarded as a permanent difference.
Table I summarizes the differences between MCTs and groups of SE experiments, and the statistical consequences of such differences for joint data analysis.
5 Limitations of Groups of SE Replications
We designed an analysis procedure that is identical to the steps followed in medicine and pharmacology to analyze and report MCTs [higgins2008cochrane, anello2005multicentre, lewis1999statistical, stewart2015preferred, schulz2010consort, de2017rational]. We adapted this procedure to groups of SE replications taking into account their typical characteristics in order to overcome common limitations with regard to joint data analysis. After revisiting the groups of SE replications that we identified in our SMS [adrisms], we came up with a list of four major limitations regarding joint data analysis practices. In the following, they are reviewed one by one.
Limitation 1: Fiftythree percent of the groups of SE replications use either narrative synthesis or aggregation of values to aggregate replication results [adrisms]. Even though we agree with the use of narrative synthesis and aggregation of values when the raw data and summary statistics are unavailable or when response variables are incompatible [popay2006guidance, rodgers2009testing], we are skeptical about their use when the raw data are available and the replications have identical designs and response variables [adrisms]. In the last analysis, access to the raw data may offer the possibility of providing more informative joint conclusions than just a textual summary of results (narrative synthesis) or a joint value (aggregation of values).
Limitation 2: Thirtythree percent of the groups of SE replications were analyzed by means of IPDMT [adrisms]. This technique may provide misleading results if participants are more similar within replications than between replications (e.g., when the replications are either run with professionals or with students), or if sample sizes are unbalanced across the treatments and/or replications (e.g., if the replications have different sample sizes and there are missing data).
Limitation 3: Thirtyeight percent of the groups of SE replications were analyzed by means of AD with standardized effect sizes (such as Cohen’s d or the Pearson correlation) [adrisms]. Even though AD with standardized effect sizes can be used to aggregate experiment results in systematic literature reviews (as access to summary statistics or to standardized effect sizes may be guaranteed), we question its use alone when the raw data are available and replications have identical response variable operationalizations^{8}^{8}8Groups of SE replications seldom justify the selected aggregation technique [adrisms]. Nevertheless, there appears to be evidence that AD and stratified IPD tend to provide similar results with regard to the provision of joint conclusions [lyman2005strengths, smith2011individual]. The identification of participantlevel moderators, however, is a different matter, where stratified IPD comes out on top [fisher2017meta].. This is because standardized effect sizes overlook the response variable scales, and, thus, may affect the interpretability of joint conclusions. For instance, how relevant is a joint Cohen’s d of 0.3? On the contrary, if the replications have identical response variable operationalizations and access to the raw data is guaranteed—as is typically the case in groups of SE replications [adrisms]—, it may be possible to apply IPDS [stewart2002ipd, lyman2005strengths, debray2015get] and, thus, interpret results in natural units. This practice has already been applied in SE [runeson2011comparative, krein2016multi, ricca2014assessing] and can lead to more informative joint conclusions.
Limitation 4: SE researchers rarely acknowledge the limitations of the exploratory analyses that they undertake for identifying moderators. Besides, they usually identify moderators textually (e.g., ”as the results are ’statistically significant/positive’ in Experiment 1 and not in Experiment 2, this difference between the results could be due to moderator variable X” [adrisms]).
6 Procedure for Analyzing Groups of Replications
We propose the adoption of a fourstep procedure to analyze groups of SE replications.
Step 1. Describe the participants. We propose to start by describing the participants of the replications. The objectives of this step are not only to describe the population to which the results should be generalized, but also to suggest plausible sources of heterogeneity that may arise when providing joint conclusions [lewis1999statistical, schulz2010consort]. This step can be further broken down into two main activities:

As typical in MCTs, we propose to start by providing summary statistics to describe the characteristics of the participants [lewis1999statistical, schulz2010consort].
Step 2. Analyze individual replications. We propose to preprocess, describe and analyze the data of each replication with consistent statistical techniques. The objectives of this step are to provide descriptive statistics to ease the incorporation of results into prospective studies (e.g., by facilitating the recalculation of effect sizes [borenstein2011introduction]), identify patterns across replication results, and ensure that statistical heterogeneity is not introduced by the different methods used to analyze the replications [silberzahn2018many]. This step can be further broken down into three main activities:

As in MCTs [lewis1999statistical, stewart2015preferred, schulz2010consort, de2017rational], we propose to provide summary statistics and visualizations
(e.g., box plots or violin plots) to describe the data of each replication and use consistent preprocessing steps to remove outliers or replace missing data
[schafer2002missing, little2012prevention]. 
We adapt this step to SE by rounding out the summary statistics and box plots with a profile plot [alasuutari2008sage] showing the mean of the treatments across the replications (see Figure 3).

We analyze each replication with consistent analyses (e.g., test, ANOVA, etc. [lewis1999statistical, schulz2010consort, de2017rational]). This ensures consistency of results across the replications and eases the integration of results in later phases.
Step 3. Aggregate the results. Following analysis guidelines for MCTs [lewis1999statistical, schulz2010consort, stewart2015preferred], the results of the individually analyzed replications are aggregated to arrive at joint conclusions. The objective of this step is to increase the reliability of joint conclusions. We adapt this step to SE by proposing three guidelines, each specifically tailored to address Limitations 13 discussed in Section 5:

Guideline 1 draws upon arguments from groups of data analysis experts in mature experimental disciplines [bero1995cochrane, anello2005multicentre, lewis1999statistical, stewart2015preferred] and the latest recommendations provided by statistical reformers and associations [wasserstein2016asa, cumming2013understanding, mcelreath2015statistical] to suggest avoiding the use of narrative synthesis and aggregation of pvalues to provide joint conclusions.

Guideline 2 recommends avoiding IPDMT by default. Identical advice has been already provided in mature experimental disciplines such as medicine and pharmacology [kraemer2000pitfalls, abo2013individual, feaster2011modeling, stewart2015preferred, kahan2013analysis].

Guideline 3 draws on arguments from various resources regarding linear mixed models [brown2014applied, hox2010multilevel], references comparing the performance of IPDS and AD [stewart2002ipd], and articles comparing the performance of various IPDS models [burke2017meta] to encourage the use of both AD and IPDS in tandem to analyze groups of SE replications. A similar recommendation has already been provided in mature experimental disciplines such as medicine and pharmacology [tierney2015individual, smith2011individual].
We also adapt this step to SE by proposing the use of randomeffects models (rather than the fixedeffects models typically used in MCTs [whitehead2002meta, anello2005multicentre]) in the two activities into which this step is divided: apply AD and apply IPDS. Similar advice to this has also been given under similar circumstances in other disciplines such as the social sciences [borenstein2011introduction, greco2013meta, clark2015should].
Step 4. Conduct exploratory analyses. As in MCTs [schulz2010consort, stewart2015preferred, lau1998summing, tierney2015individual, de2017rational], exploratory analyses should be conducted after providing joint conclusions. The objective of this step is to identify experimentlevel and participantlevel moderators that may be behind the statistical heterogeneity commonly present in groups of SE replications. We adapt this step to SE by developing three new guidelines to address Limitation 4 discussed in Section 5. To do this, we rely on the recommendations provided in references on data analysis in the social sciences, biology, and medicine:

Guideline 4 provides guidance on how to identify experimentlevel moderators by means of AD and IPDS [lau1998summing, quinn2002experimental, abo2013individual, fisher2011critical, cooper2009relative, higgins2001meta].

Guideline 5 provides guidance on how to identify participantlevel moderators by means of IPDS [lau1998summing, quinn2002experimental, abo2013individual, fisher2011critical, fisher2017meta].

Guideline 6 outlines the limitations of exploratory analyses [schulz2010consort, stewart2015preferred, lau1998summing, tierney2015individual, quinn2002experimental, whitehead2002meta, cumming2013understanding].
We also adapt the procedure for identifying moderators to SE by suggesting an increase in the statistical significance threshold from 0.05 to 0.10—at the greater risk of a larger proportion of statistical errors [whitehead2002meta]. We also suggest that less attention be paid to values, evaluating instead the relevance of moderator effect sizes—and their respective 95% CIs. These recommendations account for the typically low number of replications and small sample sizes of groups of SE replications, which are an obstacle to moderator detection [adrisms]. The latter two adaptations are used again in the first two activities of this step: identify experimentlevel moderators and identify participantlevel moderators and acknowledged in the last activity of this step, acknowledge limitations of exploratory analyses.
Table II summarizes the steps of the procedure that we propose for analyzing groups of SE replications, the objectives of each step, and their associated activities. Table III links each of the proposed steps with the guidelines from medicine recommending the respective step, and the adaptation that we made for SE.
Step  Objectives  Activity 
1. Describe participants   Inform about the population under assessment  1.1. Provide summary statistics 
 Hypothesize on possible sources of heterogeneity  1.2. Provide profile plot  
2. Analyze individual replications   Ease incorporation of results into prospective studies  2.1. Provide summary statistics and visualizations 
 Identify patterns across replication results  2.2. Provide profile plot  
 Avoid heterogeneity due to different analysis procedures  2.3. Perform consistent individual analyses  
3. Aggregate results   Maximize informativeness of joint conclusions  3.1. Apply AD 
3.2. Apply IPDS  
4. Conduct exploratory analyses   Identify experimentlevel moderators  4.1. Identify experimentlevel moderators 
 Identify participantlevel moderators  4.2. Identify participantlevel moderators  
4.3. Acknowledge limitations of exploratory analyses 
Step  Recommended in…  Adaptation to SE  Adapted from… 
1. Describe participants  [lewis1999statistical, schulz2010consort]  ✓Provide profile plot  [alasuutari2008sage] 
2. Analyze individual replications  [lewis1999statistical, stewart2015preferred, schulz2010consort, de2017rational]  ✓Provide profile plot  [alasuutari2008sage] 
3. Aggregate results  [anello2005multicentre, lewis1999statistical, stewart2015preferred, schulz2010consort, feaster2011modeling]  ✓Avoid narrative synthesis & aggregation of pvalues  [cumming2013understanding, bero1995cochrane, wasserstein2016asa, mcelreath2015statistical] 
✓Avoid IPDMT  [kraemer2000pitfalls, abo2013individual, kahan2013analysis]  
✓Use AD & IPDS in tandem  [stewart2002ipd, brown2014applied, hox2010multilevel, burke2017meta, tierney2015individual, smith2011individual]  
✓Use randomeffects models  [whitehead2002meta]  
4. Conduct exploratory analyses  [schulz2010consort, stewart2015preferred, lau1998summing, tierney2015individual, de2017rational, fisher2011critical]  ✓Use AD & IPDS to assess experimentlevel moderators  [cooper2009relative, quinn2002experimental, abo2013individual, higgins2001meta] 
✓Use IPDS to identify participantlevel moderators  [quinn2002experimental, abo2013individual, fisher2017meta]  
✓Acknowledge limitations of exploratory analyses  [whitehead2002meta, cumming2013understanding, quinn2002experimental]  
✓Increase statistical threshold  [adrisms, whitehead2002meta]  
✓Evaluate effect size and 95% CI  [adrisms] 
In the next four sections we outline the four steps of the procedure that we propose for analyzing groups of SE replications. For illustrative purposes, we apply the respective step to analyze the stereotypical group of replications described in Section 3. As Step 1 (i.e., describe the participants) and Step 2 (i.e., analyze individual replications) require no further explanation, we merely apply the steps to analyze the illustrative group of replications in Sections 7 and 8. As Step 3 (i.e., aggregate the results) and Step 4 (i.e., conduct exploratory analysis) embed a set of guidelines that require further explanation, we first develop the guidelines and then go on to illustrate their application to the group of replications in Sections 9 and 10.
7 Step 1: Describe the participants
Activity 1.1. Provide summary statistics. The provision of summary statistics of the characteristics of the participants offers information about the population to which the results are to be generalized.
Example. Table IV
shows the means and standard deviations of participant programming, Java, unit testing and JUnit experience (measured by selfassessment as inexperienced, novice, intermediate and expert) across the replications. We do not find any clear patterns for averaged participantlevel characteristics (e.g., averaged experience levels for FSecure O and FSecure H participants alternate between programming and Java) across replications.
Experiment  Prog.  Java  Unit  JUnit 
FSecure H  3.67 (0.52)  2.33 (1.21)  2.17 (0.98)  2.17 (1.17) 
FSecure K  2.91 (0.70)  1.82 (0.87)  1.64 (0.5)  1.27 (0.47) 
FSecure O  3.29 (0.76)  2.71 (1.11)  2.71 (0.76)  2 (0.82) 
UPV  2.36 (0.57)  1.88 (0.60)  1.04 (0.20)  1 (0) 
Activity 2.1. Provide profile plot. The provision of a profile plot showing the mean experience of the participants across replications may enhance the understandability of the summary statistics, help to convey the variability of participant characteristics across replications, ease the identification of patterns in the characteristics of the participants across experiments, and help to identify potential sources of heterogeneity.
Example. Figure 1 shows the profile plot of the illustrative group of replications. Figure 1 indicates that there is an observable decreasing trend in averaged participant experience across replications: participants have relatively more experience with programming and Java than with unit testing or JUnit across all the replications. There appears to be a noticeable heterogeneity of averaged participant experience. This may result in statistical heterogeneity when providing joint conclusions.
Summary of example. A heterogeneous group of developers participates within the group of replications, with the most senior developers at FSecure H and FSecure O, and the junior at UPV.
8 Step 2: Analyze individual replications
Activity 2.1. Provide summary statistics and visualizations. The summary statistics and box plots provide information about the distribution of the data, facilitate the incorporation of results into prospective studies, and minimize the heterogeneity of results due to the application of different analysis techniques.
Example. Table V shows the descriptive statistics for QLTY with ITL and TDD across all replications. The respective box plots—and violin plots—are shown in Figure 2.
Experiment  Treat.  N  Mean  Corr  SD  Median 
FSecure H  ITL  0.59  
TDD  
FSecure K  ITL  0.42  
TDD  
FSecure O  ITL  0.52  
TDD  
UPV  ITL  0.47  
TDD 
As Figure 2 shows, TDD appears to outperform ITL in all replications. However, while the difference in performance between ITL and TDD is small for FSecure H and FSecure K, the difference appears to be larger for FSecure O and UPV. Noticeable—and similar—correlations appear to have materialized among the QLTY scores of the participants across replications (i.e., correlations around 0.5). This will result in greater statistical power (i.e., smaller effect size variances) when analyzing each replication and calculating their respective effect sizes [cumming2013understanding, morris2002combining]. Finally, some data points (at the bottom of the distributions) for FSecure O or UPV may be considered outliers. However, due to the already small sample sizes of the replications and missing data for UPV (two participants have data for TDD only, and another two have none), we do not remove any potential outlier from the data analysis.
Activity 2.2. Provide profile plot. A profile plot to complement the descriptive statistics provides a bird’s eyeview of the data and helps identify patterns in results.
Example. A profile plot showing the mean QLTY score per treatment across replications is provided in Figure 3.
As Figure 3 shows, TDD appears to outperform ITL across all replications as observed in the violin plot above. The extent to which TDD outperforms ITL varies widely across replications (see the different slopes of the lines). The different slopes indicate that there may be heterogeneity in the group of replications. Besides, there is no apparent pattern between ITL and TDD mean QLTY scores: the larger improvements with TDD over ITL (see the lines with the highest slopes) are achieved in the replications with the lowest and highest ITL mean scores (i.e., UPV and FSecure O replications, respectively). By chance, they are the replications with the most novice and senior developers, respectively. Thus, we cannot hypothesize, in principle, on any moderator that may be impacting results.
Activity 2.3. Perform consistent individual analyses. An analysis of the replications with consistent statistical methods ensures that differences across experiment results are not due to the use of different analysis procedures.
Example. Since all replications have an identical AB repeatedmeasures experimental design [wohlin2012experimentation], we analyze each of them with a dependent test [field2013discovering]. Table VI shows the results of the individual tests performed.
Experiment  Estimate  95% CI  value 
FSecure H  9.52  (19.58, 38.62)  0.483 
FSecure K  13.26  (7.26, 33.77)  0.193 
FSecure O  52.91  (30.44, 75.39)  0.001 
UPV  42.31  (29.02, 55.62)  0.001 
As Table VI shows, TDD outperforms ITL in all replications. Besides, the difference in performance between TDD and ITL is large and statistically significant for FSecure O and UPV but not for FSecure H and FSecure K.
Summary of example. TDD outperforms ITL in all replications. However, the extent to which TDD outperforms ITL seems largely dependent upon site.
9 Step 3: Aggregate the results
In Sections 9.1, 9.2 and 9.3, we outline the three guidelines that we propose to overcome the most common limitations of groups of SE replications when providing joint conclusions. The description includes its application to the illustrative group of replications.
9.1 Avoid narrative synthesis and aggregation of values
9.1.1 Perils of narrative synthesis and aggregation of values
Although values are commonly used to evaluate the statistical significance of results, numerous criticisms have been made with respect to their inappropriate use across the sciences [nickerson2000null, cohen1994earth]. The dichotomization of evidence possibly arising as a result of the indiscriminate use of statistical thresholds (such as 0.05 [cohen1994earth]), and the inability of values to convey the relevance of results (because they confound sample size and effect size [nickerson2000null]) are just two wellknown criticisms of values [nickerson2000null]. As effect sizes and 95% CIs can also be used to assess the statistical significance of results (i.e., if the 95% CI of the effect size does not cross 0, then results are statistically significant), some authors have suggested that effect sizes and 95% CIs should be used instead of values [wasserstein2016asa, cumming2013understanding, mcelreath2015statistical].
Bearing this in mind, neither narrative synthesis nor aggregation of values appear to be suitable for providing joint conclusions in groups of SE replications: narrative synthesis yields neither an effect size nor a value (it merely provides a textual summary of results), whereas aggregation of values provides a joint value but not an effect size.
Narrative synthesis and aggregation of values have another shortcoming: narrative synthesis weights each replication subjectively, while aggregation of values weights each replication within the joint conclusion identically [borenstein2011introduction]. Both types of weighting may be undesirable in groups of replications with different sample sizes. For example, larger (in principle, more precise) replications may have a greater weight than small replications within the joint conclusion, industrial (in principle, more representative) replications may have a greater weight than academic replications, and higher quality experiments may have a greater weight within the joint conclusion, etc.
Finally, narrative synthesis has another relevant shortcoming when providing joint conclusions. Very often nonsignificant results lead to a joint statistically significant result [borenstein2011introduction]. However, the joint conclusion of narrative synthesis would be nonsignificant if there are more nonsignificant results than significant results (as nonsignificant is the winner [borenstein2011introduction]). Narrative synthesis has been known since the 1980s to have low statistical power [hedges1980vote, borenstein2011introduction].
9.1.2 Application to the illustrative group of replications
Narrative synthesis is simply applied by providing a textual summary of results of the replications (i.e., their effect sizes and values [borenstein2011introduction]).
To apply narrative synthesis to the illustrative group of replications, the procedure is as follows: ”…even though TDD outperformed ITL in all the replications, the extent of such outperformance was largely dependent upon site. Besides, the difference in performance between TDD and ITL was statistically significant only for FSecure O and UPV. Thus, conflicting results materialized in terms of statistical significance: two replications provided nonsignificant results, while two others provided significant results. As an identical number of replications point in opposite directions—i.e., nonsignificant vs. significant, no final claims can be made about the statistical significance of results. More replications are needed to argue the statistical significance, and practical relevance of results.”
Aggregation of values procedures typically involve [borenstein2011introduction, whitehead2002meta]: (1) the individual analysis of each replication with a onesided statistical test; and (2) the later combination of the resulting values by a statistical technique like Fisher’s method.
We first analyzed each replication independently by means of a onesided dependent test [field2013discovering]. Then, we used Fisher’s method [borenstein2011introduction] to pool together the values of all the replications. The result is a statistically significant difference between TDD and ITL as a joint conclusion (=47.13; =8; 0.001). Thus, the difference in performance between TDD and ITL is statistically significant in at least one replication. However, this was already known before aggregating the results (as FSecure O and UPV’s results were already statistically significant).
Summary of example. Neither narrative synthesis nor aggregation of values provide informative joint conclusions. Narrative synthesis fails to provide a joint effect size or value and is not able to provide final claims since there are two significant results versus two nonsignificant results in our example. Aggregation of values fails because it provides a joint conclusion that was already known before results aggregation.
9.1.3 Guideline 1: Avoid narrative synthesis and aggregation of pvalues
Avoid narrative synthesis and aggregation of values to provide joint conclusions.
What impact may this guideline have on the findings of joint analyses of groups of SE replications? More informative joint conclusions could have been obtained for 53% of the groups of replications (i.e., groups that applied either narrative synthesis or aggregation of values, see Section 5). Not applying weak aggregation techniques should enhance the findings of groups of SE replications.
9.2 Avoid IPDMT
9.2.1 Perils of IPDMT
IPDMT should be avoided on two grounds. First, it may be underpowered compared to an identical IPDS model including a factor accounting for the experiment [chu2011comparing, kahan2013assessing]. In other words, IPDMT may provide a statistically nonsignificant result when it should be statistically significant. Second, IPDMT may provide biased results [kraemer2000pitfalls, kwok2008analyzing] when data are unbalanced across treatments and replications (which may be the case in groups of replications with missing data and different sample sizes) and subjects are more similar within, than between, replications (which may be the case when either professionals or students participate in the replications). Here we illustrate the perils of IPDMT with an intuitive extreme example where it provides a biased result. Like Kraemer’s example to illustrate the perils of IPDMT [kraemer2000pitfalls], we produce our example by means of simulation [cumming2013understanding].
Particularly, let us simulate two hypothetical replications comparing the performance of two technologies (e.g., Technology A vs. Technology B) on a continuous outcome of interest (e.g., quality). For simplicity’s sake, let us suppose that the replications have an identical experimental design: an AB betweensubjects design (i.e., a design where each participant is assigned to either Technology A or B). It is straightforward to simulate a group of replications with such characteristics using random draws from the data distributions that simulate the quality scores achieved with Technologies A and B across the replications [cumming2013understanding]. Each random draw will represent the quality score achieved by a hypothetical (i.e., simulated) participant. SE data may follow a myriad of data distributions [kitchenham2016robust]
. For illustrative purposes, we simulate the performance of Technologies A and B with normal distributions
[cumming2013understanding], although many other data distributions could have been used and have obtained the same results [cumming2013understanding]. Table VII shows the normal distributions that we use to simulate the performance of Technologies A and B across the replications, and the sample sizes of each of the groups (i.e., the number of participants assigned to either Technology A or B) across the replications.Technology A  Technology B  
Exp. 1  QLTY  
Sample Size  90  10  
Exp. 2  QLTY  
Sample Size  10  90 
As Table VII shows, we aim to simulate two highly unbalanced replications (i.e., 90 subjects assigned to Technology A, and 10 to Technology B in Experiment 1, and vice versa in Experiment 2). Additionally, we aim to simulate a circumstance where the mean difference in performance between Technologies B and A is expected to be around 10 in both replications (i.e., 3020 in Experiment 1 and 7060 in Experiment 2), and the participants are more similar within, than between, replications (as they achieve either much larger or much smaller scores with either Technology A or B depending upon the replication in which they participate). These are the exact circumstances under which IPDMT provides biased results.
As the difference in performance between Technologies B and A in both replications is around 10, we would expect the difference in performance to be similar for joint results.
We analyzed the data with both IPDS and IPDMT (i.e., ANOVA models that did not did not include Experiment as a factor, respectively). IPDS provides an estimate close to the expected (). IPDMT provides an estimate that deviates from the expected (). This is because IPDMT does not take into account the experiment that is the source of the data and, instead, assumes that all the data come from a single ”big” experiment. As such, IPDMT is unaware that most subjects contributing towards the mean quality score with Technology A (90/100) come from Experiment 1 (with mean scores of 20), whereas most subjects contributing towards Technology B (90/100) come from Experiment 2 (with mean scores of 70). As a result, the unbalance of subjects across the treatments and the dissimilarities of participant scores across the replications distort IPDMT results (i.e., by providing a much larger difference of results than expected). The larger the unbalance across treatments and/or sample sizes across the replications, the more biased IPDMT results will be.
9.2.2 Application to the illustrative group of replications
To apply IPDMT, the raw data of all replications are pooled together as if they come from one big experiment and then the same statistical model as used to analyze each experiment individually is applied [kraemer2000pitfalls, abo2013individual, feaster2011modeling].
We apply a dependent test to analyze the raw data of all the replications together. IPDMT provides a joint estimate equal to and a value0.001. Thus, the difference in performance between TDD and ITL is large—insofar as QLTY ranges from 0 to 100—and statistically significant.
To apply IPDS, raw data from all replications are pooled together, and then two factors —Treatment and Experiment— are used [kraemer2000pitfalls, abo2013individual, feaster2011modeling]. In other words, a linear regression model (e.g., ANOVA) that takes into account the source of the raw data is fitted.
We applied an ANOVA with Treatment and Experiment as factors to analyze the raw data of the illustrative group of replications. IPDS provides a joint estimate equal to and a value0.001. Thus, the difference in performance between TDD and ITL is large—insofar as QLTY ranges from 0 to 100—and statistically significant.
So, both IPDMT and IPDS provide similar results in the illustrative group of replications (i.e., a large and statistically significant result). This is because data are perfectly balanced within replications (as the replications are AB repeatedmeasures designs), missing data have a relatively low impact on results (as just a few subjects have missing data for UPV), and subject scores are similar across the replications—note that the mean scores for ILT are clustered around 25 (see Figure 3, Section 8). However, this cannot be guaranteed in all circumstances.
9.2.3 Guideline 2: Avoid IPDMT
Avoid IPDMT due to its potential to provide biased or underpowered results.
What impact may this guideline have on the findings of joint analyses of groups of SE replications? Thirtythree percent of the groups of replications (i.e., where IPDMT was applied to provide joint conclusions, see Section 5) could have arrived at less biased and less underpowered joint conclusions. Not applying IPDMT should enhance the findings of groups of SE replications.
9.3 Use AD and IPDS in tandem for joint analysis
9.3.1 Benefits of using AD plus IPDS
AD and IPDS are complementary in some respects. While AD provides certain advantages for analyzing groups of SE replications, IPDS provides others. Some of the advantages of AD over IPDS are:

AD can be used to analyze groups of replications with different response variables (e.g., by computing standardized effect sizes such as Cohen’s d [borenstein2011introduction]). On the contrary, IPDS can only be applied whenever identical response variable scales are used [whitehead2002meta, stewart2002ipd]. Thus, if response variables change across the replications, AD may be the only available option.

AD provides intuitive visual summaries of results (i.e., forest plots) that have been commonly used in SE to synthesize the findings of experiments gathered by means of systematic literature reviews [kitchenham2004procedures]. On the contrary, less standardized visualizations are available for IPDS (e.g., 95% CI plots, error bars, etc. [cumming2013understanding]). Thus, the appeal and familiarity of forest plots in SE is a plus for AD over IPDS.

AD is useful for interpreting the heterogeneity of results with straightforward statistics and tests (e.g., the statistic and the test [borenstein2011introduction]). Besides, rules of thumb are also available for interpreting the statistic (i.e., 25%, 50% and 75% for small, medium, and large heterogeneity, respectively [borenstein2011introduction]). On the contrary, IPDS may require either: (1) the standard deviation of results to be contrasted against the joint result; or (2) fixedeffects models (such as ANOVA) to be used with Treatment by Experiment interaction terms to be able to claim that results are heterogeneous when the interaction is statistically significant [whitehead2002meta]. However, we do not encourage the latter procedure in groups of SE replications, which are typically small. Therefore, this method of heterogeneity detection would be underpowered [whitehead2002meta].
On the other hand, IPDS has some advantages over AD:

IPDS can simultaneously assess the difference in performance between the treatment and the control group (like AD), as well as the performance of the control group in order to weight their relative size in natural units. For example, if the difference in performance between the means of the treatment group and the control group is equal to 20, and the mean performance of the control group is equal to 20, then the treatment doubles the performance of the control. On the contrary, AD commonly relies on standardized effect sizes (e.g., Cohen’s d [borenstein2011introduction]) to convey the difference in performance between the treatment and control. This may affect the interpretability of results: how relevant is a Cohen’s d of 0.3?

IPDS can simultaneously assess the effect of multiple factors and their interactions on results (e.g., the effects of the treatments, the tasks, and their interaction in ANOVA models [wohlin2012experimentation]). On the contrary, AD is commonly used to perform pairwise comparisons between treatments (e.g., Treatment A vs. Treatment B [borenstein2011introduction]). Thus, IPDS is more flexible than AD for analyzing groups of replications when multiple factors are of interest or the results depend upon interaction terms (e.g., when the effects of the treatment reverse depending upon the task being developed).

Some IPDS models such as LMMs can be used to analyze groups of replications with missing data—provided that the data can be assumed as missing at random [brown2014applied]. On the contrary, the calculation of effect sizes—and their respective variances—using AD rests on the assumption of complete observations (otherwise, it would not be possible to compute the variances of some effect sizes for repeatedmeasures designs [borenstein2011introduction]
). If there are missing data, researchers performing AD to calculate effect sizes and their respective variances may have to either exclude participants with missing data or rely on advanced imputation techniques for their inclusion (see Section
12). Thus, if there are dropouts or protocol deviators across the replications, IPDS models such as LMMs may come in handy [twisk2013multiple].
Therefore, the application of both techniques in tandem takes advantage of the strengths of each one. Regarding the type of model to be used, both AD and IPDS are statistical procedures that deliver a weighted average of experiment results as a joint conclusion [whitehead2002meta, borenstein2011introduction]. The weight—or contribution—of each experiment towards the joint conclusion is proportional to either the sample size of the experiment—if a fixedeffects model is used—or to the sample size of the experiment and the statistical heterogeneity of results (i.e., the variation of results that cannot be explained by natural variation)—if a randomeffects model is used [borenstein2011introduction, brown2014applied].^{9}^{9}9Assuming a common variance across all replications. Besides, if results are more heterogeneous, the weights of all the experiments within the joint conclusion will be more alike. Intuitively, as each experiment may be estimating a potentially different effect size when there is a large heterogeneity of results, smaller experiments are still informative about the distribution of effect sizes (as their effect sizes are also feasible). In turn, both small and large experiments tend to be regarded as being more equally informative in randomeffects models (even though larger experiments have a slightly greater weight within the joint conclusion [borenstein2011introduction]).
As the heterogeneity of results is commonplace in SE experiments [juristo2012replication, sjoberg2007future, hayes1999research, miller1999can, hannay2009effectiveness], many factors may have an impact on SE experiment results [basili1999building], and experimental changes, opportunistic recruitment of participants and different types of subjects (e.g., professionals vs. students) are typical in groups of SE replications (see Section 2), we recommend relying by default on randomeffects models to provide joint conclusions.
Specifically, if using AD we suggest the use of randomeffects metaanalysis models [borenstein2011introduction]. If using IPDS, we suggest the use of linear mixed models (LMMs) [brown2014applied].
9.3.2 Application to the illustrative group of replications
Activity 3.1. Apply AD. The application of AD requires calculating the effect sizes—and corresponding variances—of the replications from their summary statistics. They are then pooled using a randomeffects metaanalysis model [borenstein2011introduction].
Example. First, we calculate Cohen’s ds—and corresponding variances—of the replications from their summary statistics (i.e., sample sizes, means, standard deviations, and correlations between ITL and TDD [borenstein2011introduction]). As four subjects at UPV have missing data, their data have to be discarded—as they did not provide complete observations for the correlation between ITL and TDD [borenstein2011introduction]. Then, we pool together all Cohen’s ds using a randomeffects metaanalysis model [borenstein2011introduction]. Figure 4 shows the forest plot of the metaanalysis.
As Figure 4 shows, TDD outperforms ITL in all replications. Additionally, the joint effect size () is large—according to rules of thumb [borenstein2011introduction]—and statistically significant (as the 95% CI does not cross 0). Besides, there is, according to rules of thumb, a medium () heterogeneity of results [borenstein2011introduction]. Thus, moderators should be identified to explain the detected heterogeneity of results.
Activity 3.2. Apply IPDS. IPDS is straightforward to apply. It is sufficient to fit a LMM with two factors: Treatment and Experiment, considering Treatment as a random effect across the experiments [whitehead2002meta, brown2014applied].
Example. To analyze the illustrative group of replications with IPDS, we pool the raw data of all the replications together and then analyze them using a LMM. Table VIII shows the results of the LMM.
Factor  Estimate  95% CI  value 
ITL  27.44  (13.08, 41.79)  0.001 
TDD  56.27  (22.12, 90.42)  0.001 
28.83  (9.72, 47.93)  0.004  
16.09 
As Table VIII shows, the difference in performance between TDD and ITL is relevant () and statistically significant (=0.004). Looking at the difference in performance between TDD and ITL (i.e., ) and the effect of the control approach (i.e., ITL), we reach the conclusion that TDD doubles the performance of ITL (i.e., 56.27/27.44). Unlike AD, participants with missing data have been included to provide joint conclusions [brown2014applied]. Finally, the standard deviation of the differences between TDD and ITL across the replications (i.e., ) is relatively large compared with the overall difference (i.e., ): 16.09/28.83=0.56. This suggests that there is heterogeneity. Thus, moderator effects should be identified to explain the observed heterogeneity of results.
Summary of example. AD indicated that there is a large—and statistically significant—joint effect size with medium heterogeneity. Also, AD was able to visualize that TDD outperformed ITL across all the replications. IPDS showed that TDD doubled the performance of ITL. It also meant that we could include missing data when providing joint conclusions and confirm the statistical significance of results observed with AD.
9.3.3 Guideline 3: Use AD and IPDS

Use AD and IPDS in tandem to provide joint results. Use AD because of its intuitive visualizations (i.e., forest plots) and straightforward heterogeneity statistics. Use IPDS because of its ability to convey joint results in natural units, and its flexibility for analyzing replications with missing data.

Use randomeffects models by default to provide joint conclusions. Particularly, use LMMs [brown2014applied] for IPDS, and randomeffects metaanalysis models [borenstein2011introduction] for AD.
What impact may this guideline have on the findings of joint analyses of groups of SE replications? Adherence to this guideline may have potentially resulted in more intuitive joint conclusions for 38% of the groups of replications (i.e., groups that only applied AD with standardized effect sizes to provide joint conclusions, see Section 5).
10 Step 4: Conduct Exploratory Analyses
In Sections 10.3, 10.1 and 10.2, we outline the three guidelines that we propose to overcome the most common limitations of groups of SE replications for identifying moderators. The description includes its application to the illustrative group of replications.
10.1 Use AD plus IPD in tandem to identify experimentlevel moderators
The identification of experimentlevel moderators increases knowledge of software development. We suggest applying both AD and IPDS in tandem to identify experimentlevel moderators. The benefits of using AD plus IPD in tandem have already been discussed in Section 9.3.1. Therefore, we will not repeat our arguments here.
10.1.1 Application to the illustrative group of replications
Activity 4.1. Identify experimentlevel moderators. To identify experimentlevel moderators with AD, perform either a subgroup metaanalysis for categorical moderators or a metaregression for continuous moderators. To identify experimentlevel moderators with IPDS, fit LMMs with interaction terms [whitehead2002meta, brown2014applied].
Example. We run an AD subgroup metaanalysis to assess the effect of the type of subject (i.e., professionals vs. students) on results. Figure 5 shows the forest plot for the subgroup metaanalysis that we performed. Table IX shows the result of the subgroup metaanalysis.
Group  N  Estimate  95% CI  
Professionals  3  0.77  (0.13, 1.68)  70.8% 
Students  1  1.24  (0.72, 1.76)  0 
Difference    0.47  (0.58, 1.52)   
As Table IX shows, both professionals () and students () perform better with TDD than with ITL. However, the difference in performance between students and professionals () is relevant—medium according to rules of thumb [borenstein2011introduction]. In other words, students appear to benefit more from TDD than professionals. However, despite the relevance of the moderator effect, there was a wide 95% CI. To be precise, the 95% CI ranges from a medium and negative effect (i.e., 0.58) to a large and positive effect (1.52). This results in a nonstatistically significant moderator effect due to the small number of replications analyzed. Thus, more replications are needed to increase the precision of experimentlevel moderator effects.
Finally, we complement the results of AD with the findings of IPDS. To do this, we run a LMM with interaction terms [whitehead2002meta, brown2014applied]. Table X shows the results of the LMM that we performed.
Interaction  Estimate  95% CI  value 
Type:Students  16.32  (37.16, 69.55)  0.545 
As Table X shows, the difference in performance between students and professionals with TDD appears to be large ()—at least compared with the difference in performance between TDD and ITL in the main analysis (). In view of this, students appear to perform around 60% (i.e. 16.32/28.83) better than professionals using TDD. However, this should be further substantiated with more replications as the 95% CI ranges from negative to positive results (). Again, the group of replications is too small to detect experimentlevel moderators.
Summary of example. Students appear to benefit more than professionals from TDD. But four replications with 6, 11, 7 and 33 subjects are not enough to detect the effect.
10.1.2 Guideline 4: Use AD and IPDS to Identify ExperimentLevel Moderators
Use AD and IPDS in tandem to assess experimentlevel moderators.
What benefit may this guideline have on joint analysis practices for groups of SE replications? Fortytwo percent of the groups of replications (i.e., groups that adopted a textual approach to eliciting experimentlevel moderators, see Section 5) could have achieved more transparent moderator effects in groups of SE replications. Using AD plus IPD in tandem to identify experimentlevel moderators should enhance the findings of groups of SE replications.
10.2 Use IPD to identify participantlevel moderators
New knowledge is also gained by identifying participantlevel moderators.
10.2.1 Benefits of using IPD to identify participantlevel moderators
IPDS is better than AD at identifying participantlevel moderators [fisher2011critical, lambert2002comparison]. This is because AD may be underpowered if the averaged participant characteristics do not vary much across the replications [lambert2002comparison, debray2015get] and subject to ecological bias when identifying participantlevel moderators (i.e., the average effect may not be representative of the effect on the population) [berlin2002individual, fisher2011critical, debray2015get]. This may result in misleading conclusions. Thus, as is already common practice in medicine [fisher2011critical, fisher2017meta], we recommend relying by default on IPDS models to identify participantlevel moderators.
10.2.2 Application to the illustrative group of replications
Activity 4.2. Identify participantlevel moderators. To identify participantlevel moderators with IPDS, it is sufficient to fit LMMs with interaction terms [fisher2011critical, fisher2017meta]. As Fisher et al. [fisher2011critical, fisher2017meta] noted, special attention should be paid to separating the variance of moderator effects within and between experiments.
Example. We ran a series of LMMs with interaction terms to assess the effect of participant experience with programming, Java, unit testing or JUnit on results. Table XI shows the results of the LMMs. Figure 6 shows the regression plot for the moderator effects.
Interaction  Estimate  95% CI  value 
Programming  15.76  (0.49, 31.04)  0.04 
Java  3.85  (8.13, 15.83)  0.52 
Unit testing  11.79  (5.95, 29.54)  0.18 
JUnit  11.07  (6.26, 28.41)  0.20 
As Figure 6 shows, the more experienced participants are with programming, Java, unit testing or JUnit, the more TDD outperforms ITL. In other words, TDD appears to perform better for more experienced developers and vice versa. Additionally, as Table XI shows, the significance level for programming experience was lower than 0.1—the significance threshold that we recommended for identifying moderators (see Section 10). Thus, in principle, programming experience may be moderating the effects of TDD on quality
. However, this information should be regarded with caution because there were inflated Type I error rates (due to the execution of a total of five exploratory analyses) and a wide 95% CI interval (varying from 0.49, an almost negligible effect, to 31.04, a large effect). Thus, the results may be spurious. Finally, participant experience with programming may also be confounded with other variables (e.g., the age of the developers, their experience with Java, etc.).
Summary of example. Participant experience with programming appears to moderate TDD effects.
10.2.3 Guideline 5: Use IPDS to Identify ParticipantLevel Moderators
Use IPDS to identify participantlevel moderators.
What benefit may this guideline have on the findings of joint analyses of groups of SE replications? Seventyfive percent of the groups of replications (i.e., groups that adopted a textual approach to eliciting participantlevel moderators or did not elicit participantlevel moderators, see Section 5) could have detected more transparent moderator effects in groups of SE replications. The use of IPD to identify participantlevel moderators should enhance the findings of groups of SE replications.
10.3 Acknowledge limitations of exploratory analyses
The limitations of exploratory analyses should be acknowledged.
10.3.1 Limitations of exploratory analyses
Exploratory analyses have three limitations:

They are unable to provide causeeffect relationships because replications are designed exclusively to study the effects of the treatments on the response variables. Thus, it is impossible to establish the causeeffect relationships of other variables (e.g., moderators) on the response variables. In SE terms, it is risky to claim that the programming language is the reason for the different results detected using different programming languages across two replications. Note that other variables than the programming language, such as different participant characteristics, for example, may be the real cause for this difference in the results. If any causeeffect claims are to be made about moderators, experiments assessing such questions have to be undertaken beforehand. For instance, a new experiment where participants are randomly assigned to either one or other programming language could serve to identify whether the programming language is the cause of the effects on the results.

They increase the risk of committing statistical errors. Many statistical analyses are typically run to identify moderators (e.g., one per moderator [debray2015get]), which inflates the Type I error rates. In other words, spurious statistical significant results may emerge out of multiple testing merely by chance. Such inflated Type I error rates may need to be corrected with multiple comparison correction procedures (such as the Bonferroni correction [quinn2002experimental]
). However, this is troublesome in groups of SE replications: on top of their already small sample sizes and the small number of replications (and, thus, poor moderator detectability), even more demanding statistical thresholds are set for identifying moderators. For instance, according to the Bonferroni correction, a statistical threshold of 0.05/3 may be needed to detect moderators and assess three different moderators—one per analysis—. Because of the limitations of groups of SE replications in this regard, we recommend either: (1) setting a statistical threshold of 0.1 to identify moderators—despite the heightened probability of committing statistical errors, or (2) focusing more on the
magnitude and sign of moderator effects—and their corresponding 95% CIs—rather than on their statistical significance. 
Moderator effects can be confounded if multiple simultaneous changes are made across replications. For example, if both the programming language and the unit testing tool change simultaneously across two replications, it may be misleading to claim that differences across replication results are solely due to the programming language. The difference in the results may be due to the programming language, the IDE, or a mixture of both. Another typical case of confounding is to claim that differences across replication results are induced by the subject type when different types of subjects are evaluated across two replications (e.g., professionals vs. students) and the replications provide different results. In particular, other variables may also be behind the difference in results such as age (e.g., professionals may be older than students), motivation (e.g., students whose grades are at stake may be more motivated than professionals), treatment conformance (e.g., professionals may deviate from the procedure more than students [dieste2017professionals]), some threats to validity may materialize in some replications and not in others (e.g., dropouts, missing data, fatigue), etc.
Researchers running joint analyses of groups of replications should acknowledge the limitations of exploratory analyses in their papers.
10.3.2 Application to the illustrative group of replications
Activity 4.3. Acknowledge limitations of exploratory analyses. To acknowledge the limitations of exploratory analyses, it suffices to check through and adapt the list of limitations that we outlined in Section 10.3.1 to the group of replications—and moderators—that are to be investigated.
Example. We plan to investigate the effect of one experimentlevel moderator (i.e., type of subject, students vs. professionals) and four participantlevel moderators (i.e., experience with programming, Java, unit testing, and JUnit) on results. We acknowledge that none of the above moderators may be the real reason behind the detected heterogeneity of results, and other confounding variables may also be responsible for the detected difference in results. For instance, students appeared to be more motivated than professionals, students adhered more closely to the TDD process, professionals were older than students, etc. Also, we acknowledge that the chance of achieving spurious statistically significant results increases because we intend to run five data analyses (i.e., one per moderator, following Fisher’s approach [fisher2011critical]) [quinn2002experimental]. This may invalidate the conclusions reached. Thus, we will only use exploratory analyses as a way of motivating further research and never to draw definite conclusions [lau1998summing].
Summary of example. The group of replications is too small for making definite claims about the results regarding subject type. Additionally, results for programming experience may be spurious due to the wide 95% CI that materialized. Finally, programming experience may be confounded with other variables (e.g., age of the participants, experience with Java, etc.).
10.3.3 Guideline 6: Acknowledge Limitations of Exploratory Analyses
Separate exploratory analyses from the main analysis and acknowledge their limitations.
What benefit may this guideline have on the findings of joint analyses of groups of SE replications? Thirtyeight percent of the groups of replications that did not perform exploratory analyses (see Section 5), and 88% of the groups of replications that did not acknowledge the limitations of exploratory analyses could have identified informative moderator effects in groups of SE replications. The performance of exploratory analyses and acknowledgment of their limitations should enhance the findings of groups of SE replications.
11 Threats to Validity
We focused on the aggregation of quantitative results. What about qualitative results (e.g., text transcripts, etc.)? Throughout this article we focused exclusively on the aggregation of quantitative results into joint conclusions. We acknowledge that this limits the applicability of our guidelines. However, we decided to focus on quantitative results because SE experiments are usually coupled with the acquisition and analysis of this type of results [wohlin2012experimentation, kitchenham2015evidence, stol2018abc] and most of the groups of replications uncovered by our SMS only aggregated quantitative results [adrisms]. For an overview of the methods that can be used for aggregating qualitative —or qualitative and quantitative results—into joint conclusions, we refer interested readers to Cruzes and Dyba [cruzes2011research] and Kitchenham et al. [kitchenham2015evidence]. They discuss metaethnography, narrative synthesis, qualitative crosscase analysis, thematic analysis, metasummary, vote counting, grounded theory, content analysis, case survey, qualitative comparison analysis, aggregated synthesis, realist synthesis, metasynthesis and metastudy. The aggregation of qualitative and quantitative results is discussed in [popay2006guidance, dixonwoods2001qualitative, thomas2004integrating].
We used only one analysis procedure. Are there any limitations to the way in which we tailored it to SE? Like [whitehead2002meta, borenstein2011introduction], we acknowledge that it is unfeasible to provide definite guidance on aggregation techniques and statistical models for use across the board—independently of the characteristics of the data or the intentions of the analyst (e.g., to provide joint results or to identify moderators)—. However, we tried our best to provide tailored statistical advice for aggregating the results of groups of SE replications considering their common characteristics and the limitations on joint data analysis based on the guidelines typically followed in medicine and pharmacology to analyze MCTs [bero1995cochrane, anello2005multicentre, lewis1999statistical, stewart2015preferred] and our understanding of the statistical methods that can be used in circumstances that are typical of groups of SE replications—at least according to wellknown references in medicine, pharmacology, and the social sciences [debray2015get, whitehead2002meta, maas2005sufficient, mcneish2016effect, bell2015explaining]. We acknowledge that this is only a first approximation towards tailoring a definite analysis procedure and that further research is needed in order to provide more evidence on the suitability of the analysis procedure for analyzing groups of SE replications.
We gathered the references on data analysis opportunistically. Might not this introduce bias? Unfortunately, we could not systematically gather references on the topic of how to analyze MCTs—or groups of replications with characteristics typical in SE—: the number of articles retrieved from online databases with the terms ”aggregation” and ”experiments” was unmanageable. Thus, we acknowledge that our guidelines may be open to bias. However, we made every effort to consult reliable resources on the topic of aggregation of experiment results from both mature experimental disciplines, such as medicine and pharmacology, and other areas, such as social research, education and econometrics. We also strove to embed guidelines providing the results not only of single aggregation techniques but also of different aggregation techniques applied in tandem (see the recommendation to use IPDS and AD in tandem). This should reduce the potential bias that may have been introduced due to the nonsystematic selection of references on data analysis.
We selected randomeffects models over fixedeffects models. Are there not any limitations to this advice? Contrary to medicine, where fixedeffects models are encouraged for use by default [whitehead2002meta, anello2005multicentre, phillips2003e9], we recommend the use of randomeffects models instead [borenstein2011introduction, whitehead2002meta, greco2013meta]. This is a controversial recommendation: some authors from other disciplines suggest that a minimum of five [feaster2011modeling], ten [snijders2011multilevel, mcneish2016effect], fifteen or even more experiments [mcneish2016effect, maas2005sufficient, duncan1998context] are needed to obtain reliable variance parameter estimates in randomeffects models. However, we recommend the use of randomeffects models by default as SE is commonly concerned with providing joint results (i.e., differences between means) rather than making inferences on variance parameters. Additionally, randomeffects models tend to provide more conservative results than fixedeffects models (i.e., 95% CIs tend to be wider with randomeffects models [whitehead2002meta, petitti2000meta, chen2013applied]), and randomeffects models produce identical results to fixedeffects models when there is no heterogeneity [borenstein2011introduction]. Finally, sensitivity analyses assessing the robustness of results to the specification of fixedeffects models—rather than randomeffects models—can also be run in groups of SE replications [borenstein2011introduction]. We refer interested readers to Thabane et al. [thabane2013tutorial].
We did not check the statistical assumptions of the statistical tests used (tests, LMMs). Is this not a limitation? As usual after analyzing the data of individual experiments [wohlin2012experimentation], it is also necessary to check the statistical assumptions of the statistical tests used after aggregating the results [brown2014applied, hox2010multilevel]. For example, if LMMs are fitted to analyze the data, then the normality assumption needs to be checked [whitehead2002meta, hox2010multilevel]. We acknowledge that SE data may be not normal and concede that there are more advanced statistical methods for analyzing nonnormal data (see Section 12). However, we resorted to tests and LMMs in this article because they are robust to departures from normality [fagerland2012t, mcculloch2011misspecifying], especially with larger sample sizes—as is the case when the raw data of all the replications are pooled together—[lumley2002importance]. In any case, as model diagnostics procedures are standard across the disciplines, we refer interested readers to specialized literature on the topic [brown2014applied, hox2010multilevel, whitehead2002meta, mcculloch2001generalized].
We use only one illustrative group of replications. Is the analysis procedure applicable in other cases? For reasons of space, we illustrated the application of the analysis procedure to only one group of replications. We admit that this places some limitations on the generalizability of the analysis procedure. However, we tried to select what is, according to the results of a previous SMS addressing this issue, a representative group of SE replications [adrisms]. Accordingly, we expect our guidelines to be applicable to the analysis of a large percentage of groups of SE replications. Additionally, we also provide further references in Section 12 indicating how to analyze groups of replications with different experimental designs and data characteristics.
12 Alternative Experimental Designs
As typical in groups of SE replications, this article analyzes a group of replications where all the replications have an identical experimental design: an AB withinsubjects design (i.e., a design where the participants apply first Treatment A and then Treatment B in a later session [wohlin2012experimentation]). It is straightforward, based on our procedure, to analyze groups of replications where all the experiments have an identical AB betweensubjects design (i.e., a design where the participants are randomly assigned to either Treatment A or B): (1) for IPDS, it is sufficient to remove the repeatedmeasures structure at participant level [whitehead2002meta]; (2) for AD, it is sufficient to adapt the effect size variance formulae to the betweensubjects design [borenstein2011introduction]. Besides, as groups of replications with AB betweensubjects designs are commonplace in medicine, there is no shortage of references indicating how to analyze such designs with both IPDS [debray2015get, whitehead2002meta, fisher2011critical] and AD [borenstein2011introduction].
Whenever groups of replications contain a mixture of AB betweensubjects designs and AB withinsubjects designs, researchers can still use LMMs with the IPDS approach to provide joint conclusions—as LMMs can account for replications with missing data, provided that they are missing at random (e.g., when it is possible to tell from the experimental design whether the subject data will or will not be missing in a specified experimental session [kwok2008analyzing, hoffman2007multilevel, maas2003multilevel]). AD can also be used for analyzing mixtures of AB betweensubjects designs and AB withinsubjects designs [morris2002combining]: it is sufficient to calculate a consistent pooled standard deviation for standardizing Cohen’s d (typically the average standard deviations of Treatments A and B) and then select the appropriate variance formulae depending upon the experimental design (i.e., a withinsubjects or betweensubjects design [morris2002combining]). As experiments with different experimental designs may be estimating different true effect sizes [morris2002combining], however, exploratory analyses investigating the difference of results across experimental designs [morris2002combining, thabane2013tutorial] (e.g., by means of a subgroup metaanalysis [borenstein2011introduction]) should be used.
Throughout this article, we analyzed experiments whose response variable was measured on a continuous scale, which is typical in SE [adrisms]. Also, we relied on the statistical tests that we ran being robust to departures from normality [fagerland2012t, mcculloch2011misspecifying]. Still, researchers may question the reliability of their inferences if data largely depart from the normality assumption [kitchenham2016robust]. If this is the case, researchers may resort to data transformation (e.g., BoxCox transformations) to make the normality assumption hold and then apply IPDS or AD to provide joint conclusions [quinn2002experimental]. Researchers may also resort to more advanced statistical techniques such as bootstrapping to provide inferences with both IPDS [ren2010nonparametric, field2007bootstrapping] and AD [nakagawa2007effect]. Finally, researchers applying AD can also use nonparametric effect sizes such as Cliff’s delta to provide joint conclusions [kitchenham2016robust]. Eventually, if data do not meet the homogeneity of variances assumption, researchers can opt for generalized least squares (GLS) models under the IPDS umbrella (because they can accommodate different variance terms per treatment [zuur2009mixed]) or Glass’s delta—rather than Cohen’s d—under the AD umbrella [pautz2018use].
Also, response variables may be measured on a noncontinuous scale (e.g., binary, count, etc. [quinn2002experimental]). In this case, generalized linear mixed models (GLMMs) may be more appropriate than LMMs for analyzing the data with IPDS [quinn2002experimental]. We refer interested readers to Zuur et al. [zuur2009mixed, zuur2013beginner] for an accessible introduction to the topic using the R programming language. Pautz et al. [pautz2018usenon] and Fritz et al. [fritz2012effect] provide illustrative examples of how to calculate effect sizes—to be later combined with AD—for noncontinuous response variables.
In multilevel data structures typical of groups of SE replications—where subjects are nested within replications and subjects can be measured several times (once or more per treatment) throughout the experiment—, data are typically correlated within clusters (i.e., clusters of both replications and participants [whitehead2002meta]). This correlation needs to be taken into account when analyzing the data (i.e., by including the clustering units as either fixed factors or as random factors with the selection of appropriate variancecovariance matrices [whitehead2002meta]). Throughout this article, we relied on random factors—for both participants and replications—and the variancecovariance matrix used by the lme R function when fitting LMMs: the unstructured variancecovariance matrix [finch2016multilevel]. With this variancecovariance matrix, both the treatment and control (e.g., ITL in the illustrative group of replications) effects are assumed to be correlated across the replications [finch2016multilevel]. If this does not hold, however, other variancecovariance matrices may be more suitable for analyzing the group of replications (e.g., by assuming independent treatment and control approach effects). For an accessible introduction to the topic, we refer interested readers to Finch et al. [finch2016multilevel] and Zuur et al. [zuur2009mixed].
Finally, missing data can materialize in SE experiments due to dropouts or protocol deviators. Under these circumstances, researchers may resort to LMMs under the IPDS umbrella to aggregate results (LMMs can be used to analyze groups of replications with missing data as long as data are missing at random [twisk2013multiple]). They may also rely on imputation methods (e.g., single imputation, multiple imputation, etc. [schafer2002missing]) to analyze the data with other IPDS models that do not permit the inclusion of missing data (e.g., repeatedmeasures ANOVA) or AD. Whichever procedure is finally selected for handling missing data, sensitivity analyses should be conducted to ensure that the results are robust to the missing data procedure specification [thabane2013tutorial]. We refer interested readers to Little et al. [little2012prevention] and Schafer and Graham [schafer2002missing].
13 Related Work
To the best of our knowledge, no previous attempts have been made in SE to provide guidelines for analyzing groups of SE replications—when researchers have access to the raw data and firsthand knowledge of the settings and participant characteristics. However, some previous articles already discussed the suitability of various research synthesis methods for combining published results. For instance, in the late 1990s, Pickard et al. [pickard1998combining] outlined the advantages and disadvantages of metaanalysis of effect sizes [borenstein2011introduction], Fisher’s method (i.e., an aggregation of values technique [borenstein2011introduction]) and votecounting (i.e., a form of narrative synthesis procedure [borenstein2011introduction]) for aggregating the results of a series of case studies. However, Pickard et al. [pickard1998combining] aggregated the results of case studies whose raw data were accessible—because they argued that the effect size that they used (i.e., the Pearson correlation coefficient [borenstein2011introduction]) needed to be computed from the raw data to guarantee the consistency of results across the studies [pickard1998combining]. Ultimately, Pickard et al. also acknowledged that metaanalysis could be performed as long as study reports provided appropriate summary statistics to backcalculate the necessary effect sizes [pickard1998combining].
Ever since, metaanalysis has been tightly coupled in SE with the concept of synthesizing the results of already published studies (i.e., typically with standardized effect sizes such as Cohen’s d and AD) [hayes1999research, miller2000applying, fernandez2007aggregation, shepperd2018role, kitchenham2015evidence]. To do this, researchers should backcalculate appropriate effect sizes from study reports, and, if the studies are not very dissimilar, use either fixedeffects models or randomeffects models for combination [fernandez2007aggregation, hayes1999research, kitchenham2015evidence, sjoberg2007future]. However, disparate advice with regard to the use of metaanalysis can be found in the SE literature. For example, while Pickard et al. [pickard1998combining] acknowledge that metaanalysis is only appropriate when the studies are homogeneous enough—or when the heterogeneity across the studies can be clearly attributed to certain conditions [pickard1998combining]—, Miller et al. [miller2000applying] indicate that identical studies (e.g., replications using identical materials) may result in ”strong correlations” affecting the reliability of the joint conclusions. This view has also been backed up by others in the SE community [kitchenham2008role]. At the same time, and given the commonly heterogeneous results reported in the literature and the myriad variables that typically change across SE studies, Miller et al. [miller2000applying] finally conclude that ”…the heterogeneity of current empirical results is a major limitation to our ability to apply metaanalytic procedures…”.
Due to the limitations of metaanalysis and the particularities of SE studies, other SE researchers have proposed the use of other aggregation techniques for synthesizing already published empirical study results [fernandez2007aggregation, kitchenham2015evidence, olorisade2013determining]. Briefly, such techniques commonly involve some sort of votecounting technique (e.g., counting positive vs. negative results, small vs. large results, etc.), or the application of different aggregation techniques (e.g., metaanalysis, votecounting, etc.) depending upon the characteristics of the studies being aggregated (number of available studies, number of changes made across the studies, etc. [fernandez2007aggregation]).
Similar concerns about the limitations of metaanalysis have been also raised in other disciplines over the years [lau1998summing, ioannidis2008research, gurevitch2018meta]. The overall consensus nowadays seems to be that metaanalysis of effect sizes (i.e., AD) should be preferred over narrative synthesis or vote counting techniques—at least for aggregating quantitative results [cooper2009relative, borenstein2011introduction, biondi2016umbrella, gurevitch2018meta]. Also, the metaanalysis of raw data (i.e., IPDS) outperforms AD in some circumstances (e.g., in terms of statistical flexibility or for identifying participantlevel moderators [bero1995cochrane, stewart2002ipd, fisher2017meta]). Still, the debate about the limitations of metaanalysis techniques and research synthesis is ongoing [gurevitch2018meta].
14 Conclusions
Researchers from different groups and institutions are collaborating on the construction of groups of replications in SE. Applying unsuitable aggregation techniques to analyze groups of SE replications may undermine their potential to provide indepth insights from experiment results.
We learned about the recommendations and guidelines used to analyze and report groups of replications in mature experimental disciplines such as medicine and pharmacology [bero1995cochrane, anello2005multicentre, lewis1999statistical]. Unfortunately, such guidelines could not be directly imported for the analysis of groups of SE replications because of the noticeable differences between groups of replications in SE and medicine that we came across (i.e., in terms of the number of changes made across the replications, participant heterogeneity, statistical power, etc.).
We designed an analysis procedure with a set of embedded guidelines to analyze the stereotypical group of SE replications [adrisms]. To do this, we adopted the same basic structure typically followed in medicine and pharmacology for analyzing groups of replications. However, we adapted the steps to the characteristics of groups of SE replications, and their common limitations with regard to joint data analysis. The analysis procedure that we propose outlines a minimum set of steps that may potentially increase the informativeness of joint conclusions and moderator effects. It all boils down to providing appropriate descriptive statistics and visualizations to ease the interpretation and incorporation of results into prospective studies, as well as taking advantage of the raw data to provide joint conclusions and identify moderators. AD and IPDS—randomeffects models—are crucial for this purpose. Table XII shows a summary of how to use the aggregation techniques proposed in our procedure.
AD  IPDS  
Recommended for   Results aggregation   Results aggregation 
 Experimentlevel moderators   Experimentlevel moderators  
 Participantlevel moderators  
When can be used   RV metrics can be different   RV metrics are identical 
 Complete observations   Allows missing data  
 Effect sizes (or raw data to calculate them)   Raw data are available  
are available  
How should be used   Fit random effects models   Use Linear Mixed Models (LMMs) 
 Fit random effects models  
How should be used for moderator analyses  Use either:   Fit LMMs with interaction terms 
 Subgroup metaanalysis for categorical moderators   Increase statistical significance threshold to 0.1  
 Pay less attention to pvalues and focus on effect sizes and 95% CIs  
 Metaregression for continuous moderators  
 Run one analysis per moderator 
To wrap up, we encourage SE researchers analyzing groups of replications to justify their aggregation techniques and, more importantly, to transparently report the statistical models and the raw data that they used to provide joint conclusions and identify moderators. With the aim of easing the application of the analysis procedure and with a view to reproducibility, the supplementary material of this article includes the stepbystep commented R code and raw data—with the associated R notebook—that led to the results reported throughout this article. In addition, we offer a more technical tutorial including R code snippets, dataset descriptions, and mathematical formulae to complement the understanding of the R code that we provide. All the supplementary material is also available at figshare (URL: https://doi.org/10.6084/m9.figshare.7583909.v7). We hope this encourages others to give the analysis procedure a go.
Acknowledgments
This work was partially funded by Spanish Ministry of Science, Innovation and Universities research grant PGC2018097265BI00.
Supplementary Material
Supplementary material 1: Raw data (XLSX 13 kb). Supplementary material 2: Characteristics (XLSX 9 kb). Supplementary material 3: R code (R 17 kb). Supplementary material 4: R notebook (Rmd 20 kb). Supplementary material 5: R Tutorial (PDF 316 kb).
Comments
There are no comments yet.