A Quality Assessment Instrument for Systematic Literature Reviews in Software Engineering

09/21/2021 ∙ by Muhammad Usman, et al. ∙ BTH 0

Context: Systematic literature reviews (SLRs) have become standard practise as part of software engineering research, although their quality varies. To build on the reviews, both for future research and industry practice, they need to be of high quality. Objective: To assess the quality of SLRs in software engineering, we put forward an appraisal instrument for SLRs. The instrument is intended for use by appraisers of reviews, but authors may also use it as a checklist when designing and documenting their reviews. Method: A well-established appraisal instrument from research in healthcare was used as a starting point to develop a quality assessment instrument. It is adapted to software engineering using guidelines, checklists, and experiences from software engineering. As a validation step, the first version was reviewed by four external experts on SLRs in software engineering and updated based on their feedback. Results: The outcome of the research is an appraisal instrument for the quality assessment of SLRs in software engineering. The instrument intends to support the appraiser in assessing the quality of an SLR. The instrument includes 16 items with different options to capture the quality. The item is assessed on a two or three-grade scale, depending on the item. The instrument also supports consolidating the items into groups, which are then used to assess the overall quality of a systematic literature review. Conclusion: It is concluded that the presented instrument may be helpful support for an appraiser in assessing the quality of SLRs in software engineering.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

To establish evidence-based practices in software engineering (SE), Kitchenham Kitchenham (2004) proposed the use of systematic literature reviews (SLRs) in software engineering to identify, appraise and synthesise evidence reported in the scientific literature. Today the method is well-accepted in SE. This is illustrated by the growing number of published SLRs. The number of SLRs has increased rapidly since the introduction of the guidelines Ali and Usman (2018). Furthermore, several individual researchers and research groups (beyond the proposers of the guidelines) are actively conducting and publishing SLRs. Moreover, SLRs are published in almost all top-tier SE journals and conferences and cover various topics that encompass all knowledge areas of SE. In summary, SLRs have become a standard practice in SE. A standard practice comes with requirements on, for example, the reliability of SLRs.

Thus, the reliability of conducting SLRs as a method needs to be maintained, ensuring trust in the results. Several researchers have proposed guidelines and checklists to design, conduct and report SLRs. However, relatively little work has been done on the critical appraisal of the SLRs. Several researchers have replicated SLRs or applied appraisal instruments to assess the quality of SLRs. Replications of SLRs are a resource-intensive undertaking, i.e., it is not scalable. Some replications exist, for example, some researchers have relied on intentional replication MacDonell et al. (2010), and other researchers have taken the opportunity to compare SLRs when two SLRs on the same topic were published Wohlin et al. (2013). In both cases, the objective was to evaluate the reliability of SLRs. An alternative to expensive replication-based evaluation is the use of critical appraisal instruments. Such instruments help assess the quality of an SLR by checking its conformance with best practices and established guidelines. Several SE researchers have used an interpretation of the criteria used by the Centre for Reviews and Dissemination (CRD) at the University of York to include an SLR in their Database of Abstracts of Reviews of Effects (DARE) Centre for Reviews and Dissemination, University of York (2019). However, these questions are insufficient to reveal significant limitations in SLRs as noted in our previous work Ali and Usman (2018).

The adherence to design and reporting guidelines helps to improve the reliability and transparency of SLRs. However, these guidelines do not provide the user the means to make a critical judgement of the risk of bias that certain decisions and actions in the design and execution of SLRs may introduce. We exemplify this distinction between the role of reporting checklists and critical appraisal tools with the following example Ali and Usman (2019): a reporting checklist will only ask if a specific aspect is described, e.g., “Are the review’s inclusion and exclusion criteria described?”, whereas a critical appraisal tool would also raise the question of appropriateness. In the case of selection criteria, an appraisal tool poses the question “Are the review’s inclusion and exclusion criteria appropriate?”. Thus, an appraisal tool also considers the quality, i.e. not only that a certain item is present.

In evidence-based medicine, several critical appraisal instruments have been developed and evaluated beyond the DARE criteria. In our previous work, we identified AMSTAR 2 (A MeaSurement Tool to Assess systematic Reviews) Shea et al. (2017) as a candidate tool for adoption in SE. We concluded that the tool would need adaptation for SE. In this paper, we report the process and outcome of adapting AMSTAR 2 for the quality appraisal of SLRs in SE. We call the adapted instrument Quality Assessment Instrument for Software Engineering systematic literature Reviews (abbreviated as QAISER).

Our approach when developing QAISER has several salient features focusing on increasing the reliability of the research outcome. We based our work on a well-accepted and validated instrument as a foundation (i.e., AMSTAR 2) Shea et al. (2017). To ensure an appropriate adaptation to SE, we collected and relied on a comprehensive set of documents with guidelines and best practices for conducting SLRs in SE. We followed a systematic and well-documented process to develop QAISER with several internal validation steps involving multiple researchers. Furthermore, we invited some leading experts in evidence-based SE research to conduct an external validation of QAISER. In each step of the process, QAISER was updated based on the feedback received and internal discussions. The outcome of the process, i.e., QAISER, is the main contribution of the paper.

The remainder of the paper is organised as follows. Section 2 presents an overview of the main critical appraisal instruments used in both SE and evidence-based medicine. Section 3 describes in detail the method undertaken for developing QAISER. In Section 4, it is described how QAISER evolved into the latest version, which is the main outcome of the research presented. Section 5 presents the QAISER appraisal instrument for SLRs in SE in detail. In Section 6 we reflect on the reliability of QAISER. Section 7 describes the guidance document and shares our reflections about using QAISER on three example SLRs. The threats to validity are presented in Section 8. Section 9 discusses the implication of the results. Section 10 concludes the paper and presents a roadmap for further evaluation and our ambition to support broader adoption of QAISER. Finally, the QAISER instrument is provided in Appendix A (attached as supplemental material), and a guidance document supporting the instrument can be found in Appendix B (also attached as supplemental material).

2 Related work

A prerequisite for a quality appraisal is that we pose the right questions. In the first version of the guidelines for systematic literature reviews in SE, Kitchenham Kitchenham (2004) identified two sets of questions from Greenhalgh Greenhalgh (1997) and Khan et al. Khan et al. (2001) to review any existing SLRs on a topic of interest. In the 2007 update Kitchenham and Charters (2007), Kitchenham and Charters added the CRD Database of Abstracts of Reviews of Effects (DARE) set of four questions to the list Centre for Reviews and Dissemination, University of York (2019). Kitchenham and Charters Kitchenham and Charters (2007) also applied the criteria to SLRs published between 2004 and 2007.

The proposal from Greenhalgh is very general; Khan et al.’s proposal is the most comprehensive, while the DARE criteria are brief and “simple” Kitchenham and Charters (2007). Among these three sets of questions proposed in the guidelines, only the DARE criteria have been widely used in the SE literature.

Kitchenham et al. Kitchenham et al. (2010) provided guidance to answer four of the five questions in the DARE criteria. Cruzes and Dybå Cruzes and Dybå (2011) observed that one of the critical questions regarding synthesis had not been included in the SE guidelines for conducting SLRs and has not been used when evaluating the quality of SLRs in SE. It should be noted that the number of questions in DARE has varied over the years; it has included either four or five questions depending on the version of DARE.

Some others have developed their own interpretation of the DARE questions Nurdiani et al. (2016); Ali et al. (2014). One shared limitation of these is the lack of traceability between the proposals and the evidence/best practices used to motivate them.

Other researchers have also been concerned with assessing quality in SLRs in SE, Dybå and Dingsøyr Dybå and Dingsøyr (2008) reviewed several proposals from evidence-based medicine to assess the quality of SLRs. They concluded that the MOOSE statement Stroup et al. (2000) is a very relevant reporting checklist for SLRs in SE. The MOOSE checklist has six main reporting items including ‘background’, ‘search strategy’, ‘method’, ‘results’, ‘discussion’ and ‘conclusions’. Each item further lists actions and details that should be provided in an SLR.

In a previous study Ali and Usman (2019), we reviewed the proposals for quality assessment for SLRs both from SE and other fields. We concluded that in the SE literature, there is an awareness of reporting checklists like MOOSE, QUOROM, and PRISMA. However, SE researchers have not yet leveraged the progress in critical appraisal tools for systematic reviews.

One essential aspect related to quality assessment is the validity threats presented by authors of SLRs. Ampatzoglou et al. Ampatzoglou et al. (2019) reviewed 100 secondary studies in SE and identified the commonly reported threats to validity and the corresponding mitigation actions. They also proposed a checklist that authors can use to design an SLR with explicit consideration for common validity threats and develop an informed plan for mitigating them. The authors state that readers can also use the checklist to assess the validity of the results of an SLR. The checklist has 22 questions grouped into three categories: study selection validity, data validity, and research validity. Furthermore, for each of the 22 questions, there are 1 to 9 sub-questions.

The checklist by Ampatzoglou et al. Ampatzoglou et al. (2019) encapsulates the current state of research regarding mitigating validity threats. Also, the checklist is a useful design tool to support the design and execution of an SLR. However, we argue that it is not a tool that enables the evaluation of completed SLRs. In this study, we have used their work to develop QAISER.

Given the lack of an appraisal tool adapted for SE, we wanted to leverage experiences from other research fields. Through an analysis of the leading appraisal tools, including ROBIS, AMSTAR, and AMSTAR 2, we identified AMSTAR 2 (A MeaSurement Tool to Assess systematic Reviews) Shea et al. (2017) as a candidate tool for adaptation to SE Ali and Usman (2019). AMSTAR was developed based on a review of available rating instruments and consolidated them into 11 appraisal items. It has since been extensively used and validated. AMSTAR 2 is a revised version of the tool that takes into account the systematically collected community feedback. The major updates for AMSTAR 2 are: (1) the consideration of SLRs that may include non-randomized studies and (2) an increased focus on the risk of bias evaluation.

AMSTAR 2 provides more comprehensive coverage of important quality aspects of an SLR that are not included in the DARE criteria that are mostly used in SE Ali and Usman (2019). AMSTAR 2 consists of 16 appraisal items and their corresponding response options and scale. Figure 1 annotates an example of an item, response, and scale from AMSTAR 2.

Figure 1: Items, responses and scale in AMSTAR 2.

Based on an analysis of related work it was decided to use AMSTAR 2 as a basis for proposing a quality assessment instrument tailored for SE.

3 Method

This section describes the four-step process we used to develop QAISER (see Figure 2 for an overview). In the first step, we identified aspects from the evidence-based software engineering (EBSE) literature relevant for inclusion in QAISER. In the second step, we adapted AMSTAR 2 for SE by customizing its items and responses. In the third step, we combined the outputs of the previous two steps by integrating the EBSE aspects into QAISER. Finally, in the fourth step, we validated QAISER by inviting external experts to evaluate its completeness, understandability, and relevance of its items and responses for SE.

Each step is further elaborated below and the details of each step are also illustrated in Figures 3 – 6.

Figure 2: Overview of the QAISER development process.

Step 1: Identifying relevant aspects from the EBSE literature

In this step, we aimed to complement AMSTAR 2 with the relevant work from the EBSE literature. We followed a systematic approach to identify and analyze the relevant EBSE work (see Figure 3).

Figure 3: Step 1: Identifying relevant aspects from EBSE literature.

We started with analyzing a closely related and recent tertiary study on validity threats in SLRs in software engineering by Ampatzoglou et al. Ampatzoglou et al. (2019). They have aggregated validity threats and corresponding mitigating actions in the form of a checklist as described above in Section 2. We analyzed their checklist to identify aspects that are covered or missing in AMSTAR 2 Shea et al. (2017).

Molleri et al. Molléri et al. (2019) recently proposed a Catalog for Empirical Research in Software Engineering (CERSE) based on a systematic mapping study of 341 methodological papers that were identified using a combination of manual and snowballing search strategies. CERSE includes available guidelines, assessment instruments, and knowledge organization systems for empirical research in software engineering. To identify additional relevant articles that are not covered by Ampatzoglou et al. Ampatzoglou et al. (2019) in their tertiary study, we selected 74 articles from CERSE that are related to systematic literature reviews (SLR) and mapping studies (SMS). We obtained the source file containing the basic information (title of the paper, publication venue etc.) for these 74 articles from the first author of CERSE Molléri et al. (2019). The first two authors independently reviewed these 74 articles to identify studies that propose or evaluate guidelines for conducting SLRs and SMSs in SE. Later, in a meeting, the first two authors developed a complete consensus on all 74 studies. The list of identified studies included, besides others, the latest version of the guidelines by Kitchenham et al. Kitchenham et al. (2015), the guidelines for mapping studies by Petersen et al. Petersen et al. (2015) and the guidelines for snowballing by Wohlin Wohlin (2014a). After including these three guidelines in our list of additional EBSE sources, we removed studies that were already covered in these guidelines Kitchenham et al. (2015); Petersen et al. (2015); Wohlin (2014a).

Step 2: Using AMSTAR 2 as a source of inspiration for SE

The first two authors jointly analyzed AMSTAR 2 to identify items that are relevant for SE. As a validation, the third author independently reviewed the list of relevant and non-relevant items identified by the first two authors. Next, the first two authors adapted the response options for SE, for example, by replacing the medicine-specific options with the appropriate SE options. The adapted response options were also reviewed independently by the third author. After discussions, we achieved complete consensus between all three authors on all changes in items and response options.

Figure 4: Step 2: Using AMSTAR 2 as a source of inspiration for SE.

Step 3: Integrating EBSE aspects

Using the outputs of the previous steps and, in particular, the relevant EBSE literature identified in Step 1, the first two authors developed the first draft of QAISER. They also prepared a guidance document to support QAISER users in applying the instrument. The third author independently reviewed the instrument and the guidance document to validate its contents, i.e., to check that any relevant aspect is not missed. The independent review helped improve the formulations and remove some inconsistencies in the instrument and the guidance document. However, it did not result in any significant change in the instrument.

Figure 5: Step 3: Integrating EBSE aspects.

Step 4: Validating QAISER

In this step, QAISER was reviewed by four leading experts in EBSE to validate the appropriateness of its items and reflect on its completeness (i.e., to identify if some aspects are missing) and understandability. In addition to QAISER and the guidance document, we prepared the following two documents to conduct the validation step (see Figure 6 for details about the validation step) systematically:

  • A task description document: It described the steps that the external experts were asked to perform while reviewing QAISER. The task description document provided space where experts could enter their feedback on each QAISER item.

  • A process description document: It briefly described the process we used to create QAISER.

Before the external validation, we performed a pilot validation with a senior colleague at our department who has experience of participating in multiple SLRs. The colleague reviewed all of the four documents mentioned above (i.e., task description, process description, QAISER, and the guidance document) and provided written feedback. We also conducted a follow-up interview (one hour, face-to-face) to discuss the feedback in detail and to ensure a shared understanding. We revised the task description and also the instrument based on the feedback collected during the pilot step. Most of the changes resulted in revised formulations. We shared the revised documents with our colleague and achieved consensus on the clarity of the task description and completeness and appropriateness of QAISER.

Next, we used the same approach with the external experts as we followed during the pilot. After obtaining the written feedback and performing the interviews (approximately one hour each and on distance) with all four external experts, we analyzed the comments to identify the changes that should be made in QAISER. Also, a revised version of QAISER (the one presented in Appendix A in the supplemental material) and a summary of their feedback and our actions were sent to the external experts.

Figure 6: Step 4: Validating QAISER.

4 Details of conducting the study

In this section, we present details of how we applied the process described in Section 3 and our justifications for the proposed changes to AMSTAR 2 while adapting it for SE.

4.1 Development of QAISER V0

In Step 1 of the process described in Section 3, we identified and selected four sources (see Ampatzoglou et al. (2019); Kitchenham et al. (2015); Petersen et al. (2015); Wohlin (2014a)), in addition to DARE Centre for Reviews and Dissemination, University of York (2019), from the EBSE literature to identify the relevant aspects for QAISER. Later, based on the suggestions of the external experts, we also included two more sources for identifying the relevant aspects for QAISER. The two additional sources related to a framework for an automated-search strategy to improve the reliability of searches in SLRs Ali and Usman (2019), and a tertiary study describing lessons learned about reporting SLRs Budgen et al. (2018).

We now present the adaptation of AMSTAR 2 for SE based on the procedure detailed in Steps 2 and 3 of our method (see Section 3). Overall, at this stage in the process, we had two major changes in AMSTAR 2. The first change relates to the removal of existing items in AMSTAR 2. The removal includes excluding one item and replacing two items with a general item that is more appropriate for SE. The second change concerns the addition of an item.

In terms of removed items, three AMSTAR 2 items were not included in QAISER as these were not deemed relevant to SLRs in SE. The details for Items 1, 11 and 12 are described later in the section. However, in summary, AMSTAR 2 Item 1 is about using PICO (Population, Intervention, Comparator group, Outcome) components in research questions and selection criteria. While Items 11 and 12 are about meta-analysis, which is not commonly conducted in SE SLRs. We replaced these two items with a more general item about synthesis (see QAISER Item 11 in Appendix A). The new item checks if the included studies are synthesized or not. Synthesis of the included studies is one of the essential steps in an SLR Ampatzoglou et al. (2019); Budgen et al. (2018); Centre for Reviews and Dissemination, University of York (2019); Kitchenham et al. (2015).

The addition of one item is due to the following. Item 5 in AMSTAR 2 checks if the study selection is performed independently by at least two authors of the review. Item 6 checks the same aspect about the data extraction process. However, no item in AMSTAR 2 checks if the quality assessment is performed independently by at least two persons. We introduced an additional item to cover this aspect, i.e., to see if the quality assessment is performed independently by at least two authors of the review (see QAISER Item 10 in Appendix A).

We now describe in detail why and what changes were made to each item in AMSTAR 2.

Item 1. “Did the research questions and inclusion criteria for the review include the components of PICO?”
In SE, researchers are not widely using PICO when developing research questions and inclusion criteria. It is also not part of the revised guidelines Kitchenham et al. (2015).
Changes: This item is not relevant for SE and was excluded from QAISER.

Item 2. “Did the report of the review contain an explicit statement that the review methods were established prior to the conduct of the review and did the report justify any significant deviations from the protocol?”
We identified no need to make any change at the item level. However, the following issues in the response options were noted:

  1. The response options for ’Partial Yes’ lack several aspects that are part of the protocol template included in the revised guidelines Kitchenham et al. (2015). The missing aspects include description of the need for the review, data extraction process, synthesis process, threats to validity of the review, deviations from the protocol and the corresponding justifications for such deviations, and details of conducting the review.

  2. Under ’Partial Yes’, the authors are only required to state that they had a written protocol. The authors should also make the protocol publicly accessible and describe where and how can it be accessed Kitchenham et al. (2015).

  3. One of the response options uses the term “risk of bias assessment”. In SE, the more commonly used term is quality assessment.

Changes: Based on the analysis, the response options were modified as follows as an adaptation of them for SE:

  1. The missing response options under ’Partial Yes’ were added.

  2. In the revised item, the authors are also required to make the protocol accessible and state how and where it can be accessed.

  3. The risk of bias related response option was rephrased as quality assessment.

Item 3. “Did the review authors explain their selection of the study designs for inclusion in the review?”
Most reviews in SE include different types of empirical studies. Thus, it is not relevant to ask for a justification for including all types of study designs. Furthermore, the study design is only one of the criteria for including or excluding studies from an SLR. Therefore, the item should address the larger aspect of the appropriateness of the inclusion and exclusion criteria. Reporting of the inclusion and exclusion criteria is also part of the criteria used by the Centre for Reviews and Dissemination at the University of York to include an SLR in their Database of Abstracts of Reviews of Effects (DARE) Centre for Reviews and Dissemination, University of York (2019). Also, reporting of the inclusion and exclusion criteria and the relevant justifications is part of the guidelines Kitchenham et al. (2015) and other EBSE literature as well Ampatzoglou et al. (2019).
Changes: The item is revised as follows for SE: Are the review’s inclusion and exclusion criteria appropriate?
To get a ’Yes’ score for the revised item, the review should have reported the inclusion and exclusion criteria and provided justifications for any restrictions used in the criteria.

Item 4. “Did the review authors use a comprehensive literature search strategy?”
We identified the following issues in the response options:

  1. The response options treat database search as the main search method while snowballing is partially addressed as an additional search method. In the revised guidelines for performing SLRs in SE Kitchenham et al. (2015), database and snowballing searches are included as alternate search strategies. Both strategies have been used in SE SLRs and have their own guidelines and best practices (cf.Wohlin (2014a); Ali and Usman (2018); Kitchenham et al. (2015)). In the current form, only the database search strategy could be assessed as comprehensive.

  2. The response option related to the publication restrictions is more relevant to the inclusion and exclusion criteria.

  3. Furthermore, two other response options are not used in SE SLR: The first one is about searching in the study registries, while the second one is about conducting the search within 24 months of completion of the review.

Changes: We introduced the following three changes:

  1. Two groups of response options were created: first when a database search is used as the main search method and the second when a snowballing search is used as the main search method (See QAISER Item 4 in Appendix A) for details about the two groups of response options).

  2. The response option related to the publication restrictions is moved to Item 3 (see Appendix A).

  3. The two response options (searching in registries and search within last 24 months) were not included in QAISER.

Item 5. “Did the review authors perform study selection in duplicate?”
We noted that:

  1. The phrase “in duplicate” is not a commonly used term in SE and is therefore not self-explanatory. Furthermore, the item does not specify if the study selection is performed on the full text or on the titles and abstracts.

  2. In the first response option, when all studies are reviewed independently by at least two authors of the review, the agreement level is not reported. Reporting of the agreement level would increase the transparency of the study selection process.

  3. In the second response option, it is permitted that only a sample of the studies are independently reviewed by at least two authors of the review. The reliability of the study selection process is compromised if only a small sample of studies is reviewed by more than one author of the review. In particular, the excluded studies pose a threat to validity if a single person excludes them.

Changes: Three changes were introduced to address these observations:

  1. The item was rephrased to clarify the focus on the independent study selection and that the initial study selection is based on titles and abstracts. The revised formulation is: “Did the authors of the review independently perform study selection based on titles and abstracts?”

  2. At the end of the first response option, the following text is added to make it necessary to report the agreement level as well: “… and reported the agreement level”.

  3. At the end of the second response option, the following text is added to make it compulsory to have the excluded studies reviewed by at least two authors: “however, all excluded studies must be reviewed by at least two authors of the review”.

Item 6. “Did the review authors perform data extraction in duplicate?”
As in the previous item, the phrase “in duplicate” is not self-explanatory.
Changes: The item was rephrased in QAISER as follows: “Did at least two authors of the review independently perform data extraction?”

Item 7. “Did the review authors provide a list of excluded studies and justify the exclusions?”
The item is about those studies that were excluded after reading the full text. The item does not indicate that it is about those studies that were read in full text, and not about those that were excluded based on the screening of the titles and abstracts.
Changes: The item was rephrased to indicate that it is about those studies that were read in full text. In the revised formulation, the following phrase is added at the end of the item text: “…for the papers read in full text?”

Item 8. “Did the review authors describe the included studies in adequate detail?”
We did not note any issues in the item. However, the response options about intervention and outcomes may not be relevant to all SLRs in SE. In SE, not all SLRs would be about interventions and outcomes. The included studies in an SLR may not have investigated any interventions.
Changes: In the response options about interventions and outcomes, the phrase “when applicable” is added to explain that the review needs to describe only the relevant information about included studies.

Item 9. “Did the review authors use a satisfactory technique for assessing the risk of bias (RoB) in individual studies that were included in the review?”
We noted the following:

  1. In SE SLRs, the concept of quality assessment is used, instead of RoB, to refer to the quality assessment of the individual studies. A variety of quality assessment instruments have been developed and used to assess the quality of the different types of empirical studies in software engineering Kitchenham et al. (2015). The focus in SE is on using relevant quality assessment instruments.

  2. The current response options are not relevant to SE. Furthermore, the focus of the item is suggested to be changed to the quality assessment instrument. Therefore, the response options should also be revised accordingly to check the completeness and relevance of the questions in the quality assessment instrument.

Changes: We introduced the following changes:

  1. We revised the item to emphasize whether or not the review authors have provided an explanation for their selection of the quality assessment instrument. The item is revised as follows: “Did the review authors explain their selection of quality assessment instrument?”

  2. With regards to the response options under the revised item, for ’Yes’, the review authors should have selected an appropriate quality assessment instrument for different types of studies included in the review. Furthermore, the instrument needs to have questions about study goals, research questions, appropriateness of the study design, data collection, and analysis methods. The instrument should also have question(s) about the study findings and the supporting evidence, and the extent to which the findings answer the research questions. We refer to the instrument in Appendix A for the specific response options for this item in QAISER.

Item 10. “Did the review authors report on the sources of funding for the studies included in the review?”
This item focuses only on the sources of funding for individual studies. Funding is one of the issues that could result in a conflict of interest. In some cases, the authors of the individual studies might have some other conflict of interest in favor of or against the topic or intervention they are investigating in their studies.
Changes: The item is revised to include any other conflict of interest besides funding sources. Conflict of interest is inserted in the item text as follows: “Did the review authors report on the sources of funding and any other conflict of interest for the studies included in the review?”

Item 11. “If meta-analysis was performed did the review authors use appropriate methods for statistical combination of results?”
Meta-analysis studies are very rare in software engineering due to the lack of randomized controlled trials. Therefore, this item is not relevant to the majority of the SE SLRs.
Changes: This item is removed from the adaptation of AMSTAR 2 for SE. We have instead included a more general item about synthesis (Item 11 in QAISER, see Appendix A).

Item 12. “If meta-analysis was performed, did the review authors assess the potential impact of RoB in individual studies on the results of the meta-analysis or other evidence synthesis?”
As discussed with Item 11 above, meta-analysis is not common in SE SLRs. This item is removed from the adaptation of AMSTAR 2 for SE. However, it is important to note that considering the impact of the quality of individual studies while interpreting the results is still covered in the next item.

Item 13. “Did the review authors account for RoB in individual studies when interpreting/ discussing the results of the review?”
We noted the following:

  1. Instead of RoB, the SE community uses the notion of quality assessment more commonly.

  2. The first response option deals with the inclusion of high-quality randomized controlled trials (RCTs). Since in SE, RCTs are not common, the focus should be on high-quality studies.

  3. The second response option includes the requirement of discussing the impact of RoB on results. For SE, the focus has been on categorizing the analysis and interpretation of results based on study quality Kitchenham et al. (2015).

Changes: The following changes were introduced:

  1. In the item description, the RoB is replaced with quality of individual studies.

  2. In the first response option, the phrase “high quality RCTs” is replaced with “high quality studies”.

  3. The second response option is revised to focus on the categorization of the analysis and interpretation of results based on study quality.

Item 14. “Did the review authors provide a satisfactory explanation for, and discussion of, any heterogeneity111Heterogeneity occurs when the results are not consistent across studies. For example, different studies provide conflicting evidence for or against a software engineering intervention. It is important to investigate the causes of such inconsistent results before drawing any conclusions in such cases. observed in the results of the review?”
We identified no need for adaptation to SE in this item.

Item 15. “If they performed quantitative synthesis did the review authors carry out an adequate investigation of publication bias (small study bias) and discuss its likely impact on the results of the review?”

  1. The item is limited to quantitative synthesis. In SE, qualitative synthesis is used more frequently in SLRs. Discussing publication bias and its impact on the results is important, regardless of the type of synthesis performed, quantitative or qualitative.

  2. The response option includes a requirement to carry out graphical or statistical tests as well. The main aspect to cover in this item should be to check if the authors of the review have discussed publication bias and discussed its potential impact on review results.

Changes: We introduced the following changes:

  1. The item is made more general by removing the word quantitative while also adapting its formulation for SE.

  2. The response option is also revised accordingly, i.e., removing the reference to the graphical or statistical tests. The revised response option aims to check if the publication bias and its impact on the results are discussed or not.

Item 16. “Did the review authors report any potential sources of conflict of interest, including any funding they received for conducting the review?”
We identified no need for adaptation to SE in this item.

We call the resulting instrument that systematically adapts AMSTAR 2 for SE and supplements it SE guidelines and evidence QAISER V0. This version was used in Step 4 (see Section 3 for details of the process) for further validation.

4.2 Changes in QAISER during the validation step

This section presents the changes made in QAISER V0 based on the feedback collected during the pilot and external validation steps.

Besides several editorial changes, the pilot validation resulted in the following two main changes in QAISER V0:

  1. Addition of a new item on the need for undertaking the review (see QAISER Item 1 in Appendix A): In QAISER V0, establishing the need for undertaking a review was listed as one of the response options to score ’Partial Yes’ under Item 1. During the discussions in the pilot validation, we agreed with the senior colleague to give more importance to establishing the need for the review step. The number of SLRs performed in SE is increasing every year. At times, there are multiple SLRs on the same topic. Thus, there is a need to establish if it is relevant to undertake a new SLR on a topic Mendes et al. (2020); Kitchenham et al. (2015); Petersen et al. (2015). The authors of the review should justify the need for undertaking the review. To score ’Yes’ on this new item in QAISER, the review should have 1) discussed the related existing reviews (if any) and established the need for another review by highlighting the gap, or 2) established the need to aggregate the evidence on a topic, if there exist no reviews on the topic.

  2. Addition of a new response option under the synthesis related item in QAISER V0 (Item 11): Agreeing with the suggestion of the senior colleague, we added another response options under Item 11 in QAISER to check how effectively the authors of the review have linked the answers and interpretations with the data extracted from the primary studies. The new response option is described as: “Provided a clear trace linking the answers of review questions and interpretations to the data from the included primary studies”.

The revised QAISER, after the pilot validation step, was shared with the external experts for further validation. The external experts provided several improvement suggestions. We provide a summary of the main suggestions related to items and response options in the following:

  • Introduce an item about recommendations: SLRs are supposed to provide evidence-based input to practitioners and researchers to aid them in making informed decisions. QAISER did not have any item that specifically covered this aspect. The external experts suggested including an item that checks if the review provides appropriate recommendations and conclusions based on the review results. Agreeing with the external reviewer’s suggestion, we added a new item about recommendations and conclusions in QAISER (see QAISER Item 14 in Appendix A).

  • Remove the item about sources of funding (see AMSTAR 2 Item 10 described in Section 4.1): The item deals with the reporting of the sources of funding for the included studies. The external experts suggested to remove it as they did not find it relevant in SE context. We removed this item from QAISER.

  • Introduce ’Partial Yes’ scale: Some items (Items 1, 5, 6, and 10) had a binary Yes/No scale. The external experts suggested introducing a third scale value of ’Partial Yes’ to make them more flexible. We introduced a ’Partial Yes’ option under these items and included the minimum acceptable requirements as response options (see QAISER Items 1, 5, 6, and 10 in Appendix A).

  • Quality focus: Assessing SLRs is not only about the presence or absence of an aspect; it is largely a subjective judgment concerning decisions and measures taken by the authors. To incorporate this suggestion, we introduced adjectives such as adequately, reliable, and appropriate in several items to assess SLRs’ subjective nature better.

  • Modifications to the protocol-related item (see AMSTAR Item 2 described in Section 4.1): The external experts suggested simplifying the response options for the ’Partial Yes’ scale. We moved justification of any deviations from the protocol from ’Partial Yes’ to the ’Yes’ scale. Furthermore, threats to validity and details of conducting the review were removed from the ’Partial Yes’ scale. We also removed a response option about heterogeneity from the ’Yes’ scale. It was not deemed a necessary part of a protocol by the experts (see the revised description of QAISER Item 2 in Appendix A).

  • Modifications to the heterogeneity-related item (see AMSTAR Item 14 described in Section 4.1): The external experts did not find this item to be essential for the systematic reviews in software engineering. The item is more relevant for meta-analysis studies, which are not common in software engineering. We replaced the heterogeneity concept with the characteristics of the primary studies. Some differences in the results of the primary studies may be due to the variations in the studies’ characteristics, e.g. if the participants in different studies are students or practitioners. Therefore, in the case when there are differences in the results of the primary studies, the authors of the review should perform an analysis to see if the differences are due to the variations in the primary studies’ characteristics.

5 Results

In this section, we present the main outcomes of our study, i.e., QAISER after the validation step. We also describe how QAISER can be used to appraise an SLR.

QAISER aims to support appraisers of SLRs in SE by raising important questions about the reliability and the relevance of an SLR. Furthermore, by providing evidence-based and accepted best practices in software engineering research (i.e., established expectations in the SE field of a high quality SLR), it supports the judgement of the conformance and the likely impact of non-conformance on the reliability and relevance of an SLR.

The quality aspects of concern and related criteria in QAISER are based on available evidence and recommendations in the SE literature. Therefore, the availability of evidence and the specificity of guidelines is also reflected in the criteria used in QAISER. Thus, the responses in the instrument range from specific/concrete actions to broader/general suggestions/guidelines. QAISER supports appraisers in making a judgement about the overall reliability and relevance of an SLR.

QAISER has three levels of judgement: item level, group level and SLR level. The three levels are described in the following subsections. It should be noted that AMSTAR 2 does not include these levels. The levels are introduced to support the appraiser in moving towards an overall assessment of an SLR. Table 1 presents the groups and the items of QAISER, while the complete instrument is presented in Appendix A (see supplemental material).

Group Item description and the relevant sources/references
1. Motivation Item 1: Did the authors of the review adequately justify the need for undertaking the review? Kitchenham et al. (2015); Petersen et al. (2015); Ampatzoglou et al. (2019)
2. Plan Item 2: Did the authors of the review establish a protocol prior to the conduct of the review? Shea et al. (2017); Ampatzoglou et al. (2019); Kitchenham et al. (2015)
3. Identification and selection
Item 4: Did the authors of the review use a comprehensive literature search strategy? Shea et al. (2017); Kitchenham et al. (2015); Ampatzoglou et al. (2019); Centre for Reviews and Dissemination, University of York (2019)
Item 3: Are the review’s inclusion and exclusion criteria appropriate? Centre for Reviews and Dissemination, University of York (2019); Kitchenham et al. (2015); Ampatzoglou et al. (2019)
Item 5: Did the authors of the review use a reliable study selection process? Shea et al. (2017); Kitchenham et al. (2015); Petersen et al. (2015)
Item 7: Did the authors of the review provide a list of excluded studies, along with the justifications for exclusion, that were read in full text? Shea et al. (2017); Kitchenham et al. (2015)
4. Data collection and appraisal
Item 6: Did the authors of the review use a reliable data extraction process? Shea et al. (2017); Kitchenham et al. (2015); Ampatzoglou et al. (2019); Petersen et al. (2015)
Item 8: Did the authors of the review provide sufficient primary studies’ characteristics to interpret the results? Shea et al. (2017); Kitchenham et al. (2015); Ampatzoglou et al. (2019); Centre for Reviews and Dissemination, University of York (2019)
Item 9: Did the authors of the review use an appropriate instrument for assessing the quality of primary studies that were included in the review? Shea et al. (2017); Kitchenham et al. (2015)
Item 10: Did the authors of the review use a reliable quality assessment process? Kitchenham et al. (2015)
5. Synthesis
Item 11: Were the primary studies appropriately synthesized? Centre for Reviews and Dissemination, University of York (2019); Kitchenham et al. (2015); Ampatzoglou et al. (2019); Budgen et al. (2018)
Item 12: Did the authors of the review account for quality of individual studies when interpreting/discussing the results of the review? Shea et al. (2017); Kitchenham et al. (2015); Ampatzoglou et al. (2019)
Item 13: Did the authors of the review account for primary studies’ characteristics when interpreting/discussing the results of the review? Shea et al. (2017); Kitchenham et al. (2015)
6. Recommendations and conclusions Item 14: Did the authors of the review provide appropriate recommendations and conclusions from the review? Budgen et al. (2018)
7. Conflict of interest Item 15: Did the authors of the review report their own potential sources of conflict of interest, including any funding they received for conducting the review? Shea et al. (2017); Kitchenham et al. (2015)
Table 1: QAISER items and groups.
Group Item ranking (Yes/ Partial Yes /No) Impact Comments
1. Motivation Item 1 (need):
2. Plan Item 2 (protocol):
3. Identification and selection
Item 4 (search):
Item 3 (selection criteria):
Item 5 (selection process):
Item 7 (excluded studies):
4. Data collect-
ion and app-
raisal
Item 6 (data extraction):
Item 8 (study characteristics):
Item 9 (quality criteria):
Item 10 (quality assessment process):
5. Synthesis
Item 11 (synthesis):
Item 12 (considered study quality)
Item 13 (considered study characteristics)
6. Recommendations and conclusions Item 14 (recommendation):
7. Conflict of interest Item 15 (their own):
Table 2: QAISER: group level assessment.
Confidence Groups
2. Plan 3. Identification and selection 4. Data collection and appraisal 5. Synthesis 6. Recommendations and conclusions 7. Conflict of interest
”critically low” – major weaknesses in groups 3, 4 and 5 –
”low” – major weaknesses in at most two of the groups 3, 4, 5 along with major weaknesses in groups 2 and 6 –
”moderate” – no major weakness in groups 3, 4, 5 and 7, but a major weaknesses in groups 2 or 6 –
”high” – only minor weaknesses in at the most two of the groups 3, 4, 5 and only a few minor weaknesses in groups 2, 6 and 7 –
Table 3: Judging the confidence in the results of an SLR.
Relevance Groups (1. Motivation & 6. Recommendations and conclusions)
”critically low” – major weaknesses in group 1 and 6 –
”low” – major weakness in either group 1 or 6 –
”moderate” – minor weaknesses in both groups 1 and 6 –
”high” – only a minor weakness in group 6 –
Table 4: Judging the relevance of an SLR.

5.1 QAISER: item level assessment

The first level comprises 15 items formulated as questions. These questions are ordered to reflect the sequence of phases in the design, conduct, and reporting of a typical SLR. The criteria to meet the questions on the item level are stated in the form of acceptable responses for each of the questions. All items are evaluated on a scale with two values (Yes/No) or three values (Yes/Partial Yes/No), i.e., an assessment of the extent to which an SLR under review fulfils the stated criteria.

Each item in QAISER is formulated with the objective that it is self-contained and self-explanatory. However, there is an accompanying guidance document (Appendix B in the supplemental material) with a more detailed description of the items and their responses. We recommend that before applying QAISER, the guidance document should be read for at least the first time.

5.2 QAISER: group level assessment

The external experts also provided a suggestion about clarifying the flow and sequence of the items in QAISER. To make the flow of the items more explicit and understandable, and to aggregate individual items into a logical cluster, we organized the 15 QAISER items into seven groups corresponding to the process and outcome of an SLR (see the first column in Table 1): (1) motivation to conduct a review, (2) plan and its validation, (3) identification and selection, (4) data collection and quality appraisal, (5) synthesis, (6) recommendations and conclusions, and (7) conflict of interest.

At the group level, the assessment results on the item level are used as indicators for major and minor weaknesses based on their impact on the reliability and relevance of an SLR. Having completed the assessment of individual QAISER items, an appraiser should reflect on the impact of the weaknesses on the reliability and relevance of the SLR at the group level. Groups 1, 2, 6, and 7 consist of single items only, and are therefore relatively simple to reflect upon. A ”No” rating on the corresponding items of these four groups indicates a major weakness at the group level. Groups 3, 4, and 5 consist of multiple items and are more complex to reflect upon. The appraisers should make an overall assessment after considering the ratings of all items in the groups. As a rule of thumb, we recommend that all items receiving a “No” should be considered as hinting at a major weakness in the group being assessed.

5.3 QAISER: SLR level assessment

By progressively building on the first two levels, an appraiser judges the overall reliability and relevance of an SLR at the SLR level. Thus, considering the impact of weaknesses in related groups, i.e., relevance (mainly two groups: motivation, and recommendations and conclusions) and reliability (mainly the following five groups: plan, identification and selection, data collection and appraisal, synthesis, and conflict of interest).

Once the reflection on the group level is complete, an appraiser should use it to provide the assessment about the two aspects at the SLR level as follows:

  1. Reliability of an SLR: The reliability of an SLR is assessed by rating the overall confidence in the results of an SLR as: high, moderate, low or critically low. Apart from group 1, all other groups are relevant while considering the confidence in the results of an SLR. As a rule of thumb, we recommend that the confidence in the SLRs with major weaknesses in groups 3, 4, and 5 should be considered “critically low”.

    Table 3 provides guidance for interpreting weaknesses observed at the group level to select a confidence rating at the SLR level.

  2. Relevance of an SLR: The relevance of an SLR is also rated as high, moderate, low or critically low. Groups 1 and 6 are considered when making a judgement about the relevance of an SLR. As a rule of thumb, we recommend that the relevance of an SLR be judged to be “critically low” if there are major weaknesses in both groups 1 and 6. Table 4 provides guidance in selecting a relevance rating based on the weaknesses in groups 1 and 6.

The appraisers can and should decide which groups of items are more/less critical for the specific review being assessed. The guidance provided in Tables 3 and 4 is one such recommendation for using group-level assessment in Table 2 for assessing the reliability and relevance of an SLR.

6 Reliability of QAISER

In this section, we highlight three aspects that contribute to the reliability of QAISER as a potentially effective instrument for assessing the quality of SLRs in SE.

  1. The relevance of AMSTAR and AMSTAR 2 validations: The original AMSTAR Shea et al. (2007b) consisted of 11 appraisal items. Based on community feedback, AMSTAR 2 was proposed consisting of 16 items with an increased focus on the risk of bias evaluation and the possibility to assess SLRs that may have non-randomized studies. Both AMSTAR and AMSTAR 2 have been used and validated extensively (cf.Gates et al. (2018); Pieper et al. (2015); Shea et al. (2007a)). These validation efforts provide credibility to QAISER as well, as most of its items (12 out of 15) are adapted from AMSTAR 2.

  2. Comparison with DARE: DARE Centre for Reviews and Dissemination, University of York (2019) is the most frequently used criteria to assess the quality of SLRs. Several essential aspects related to the quality of SLRs are not covered in DARE, e.g., justifying the need to conduct a review, establishing a protocol prior to performing the review, study selection process, data extraction process, and quality assessment process. Furthermore, three of the DARE criteria (including the important criterion about synthesis) are limited to checking the presence/absence of different aspects, rather than their appropriateness, e.g., if the inclusion and exclusion criteria are reported or not? QAISER not only covers aspects that are missing in DARE, but it also focuses on quality aspects of different criteria—for example, checking the appropriateness of inclusion and exclusion criteria rather than only focusing on the mere reporting of such criteria in the review report.

  3. External validation. We followed a systematic process (see Section 3 for details) to adapt AMSTAR 2 for SE and to introduce additional aspects based on the recommendations in the systematically identified EBSE literature. Four leading experts then reviewed the proposed instrument to check the appropriateness, completeness, and understanding of its items (refer to Section 3 for details about the validation step). The experts recommended some changes, which we incorporated in the revised version of QAISER. The experts did not suggest any further changes in the revised version.

7 Support for Applying QAISER

In line with AMSTAR 2, we also developed a guidance document for supporting appraisers in applying QAISER. The guidance document describes the following aspects for each QAISER item:

  1. What is the item about? We provide a brief description of the item.

  2. How to assess the item? We explain what an appraiser needs to check for assigning ’Partial yes’, ’Yes’, and ’No’ ratings.

  3. Where to find the relevant information to assess the item? We provide hints concerning which sections of the review papers are most likely to have the information needed to assess the item.

To further support the application of QAISER, we developed a spreadsheet that operationalizes the QAISER instrument. The first author used the spreadsheet to apply the QAISER instrument on three SLRs that the second author selected. The selected SLRs were all previously assessed using DARE criteria with the highest rating. The purpose of selecting only the high-ranking SLRs on DARE criteria was to illustrate the usefulness of QAISER in supporting appraisers in performing a more fine-grained and thorough critical appraisal compared to DARE. The guidance document, QAISER instrument, and the spreadsheet corresponding to the three example applications are all available online222https://drive.google.com/drive/folders/1p7OUEfqQTF4dY3e_OX_OHiyi_tC4E_cU?usp=sharing. Researchers could look at the three examples as additional support complementing the guidance document.

QAISER supported us in identifying additional weaknesses in all three SLRs that scored high on the DARE criteria. The three SLRs have different weaknesses and, taken together, include the following aspects: study selection process, data extraction process, quality assessment instrument and process, and conflicts of interest. Furthermore, for those aspects that are already covered in DARE (e.g., inclusion and exclusion criteria and search strategy), we were able to perform a more fine-grained appraisal. It resulted in identifying some additional weaknesses, e.g., appropriate synthesis and comprehensiveness of the search strategy. For details, we suggest the interested readers refer to the online spreadsheet. A unique characteristic of QAISER, which is not available in DARE, is how it allows appraisers to progressively arrive at the SLR level assessment from the group and item level assessments. In the selected SLRs, this mechanism helped us first combine the item level ratings to identify the weaknesses at the group level and then use the identified group level weaknesses to arrive at the overall judgment at the SLR level. Finally, we also found the option to add comments extremely helpful in reflecting on our ratings and corresponding justifications.

8 Threats to validity

In this research, we aimed to propose an instrument for appraising systematic literature reviews in software engineering. In the design and conduct of this research, the following two objectives guided us:

  1. To develop an instrument that is comprehensive, practical and appropriate for software engineering. Thus, the instrument shall cover all essential elements concerning the quality of an SLR, assist the appraiser when judging the quality of an SLR, and take into account the SE body of knowledge.

  2. To reduce the researcher’s bias in the development of the instrument.

The two main threats to validity identified concerned researcher bias and applicability of QAISER. The researchers come from the same affiliation, which creates a risk of having a coherent view on research. When creating an instrument for use by the research community, there is a risk that the instrument is hard to understand, and hence limiting its applicability.

To achieve the two objectives above and mitigate the threats to validity, we undertook the following actions:

  • Use of an established tool as a foundation: We used AMSTAR 2 as the starting point for our work as it is a well used and validated tool Ali and Usman (2019).

  • A systematic and rigorous process: As described in Section 3, we followed a systematic approach for the development of QAISER. All data collection, analysis and interpretation involved at least two researchers. A third researcher independently reviewed the outcomes from several individual phases in the study. We maintain traceability for the adaptations in the existing tool by documenting the reasons, sources consulted and the changes.

  • Validation with leading experts in the field: The experts consulted to validate QAISER include some of main contributors of the methodological guidelines for designing, conducting and reporting secondary studies in SE. They have also authored several SLRs and have conducted several studies reporting critical evaluations of existing SLRs in SE.

  • Comprehensive coverage of SE body of knowledge: We used a systematic approach to identify and select a representative list of sources to capture the current best practices and evidence for inclusion in QAISER.

With these actions, we have tried to mitigate the major threats to the validity of our research. However, a missing aspect in our research is the evaluation of the instrument in use (focusing on aspects like usability and reliability of QAISER). By making QAISER and guidance for its usage publicly available, we hope that we and others in the field will address this limitation in the future.

9 Discussion

Given the lack of an appraisal instrument for assessing the quality of SLRs in SE, we developed QAISER. As presented in the introduction (see Section 1), researchers in SE have used the criteria in DARE Centre for Reviews and Dissemination, University of York (2019) for assessing the quality of SLRs, although it comes with limitations Ali and Usman (2018). Furthermore, to simply use an appraisal instrument, such as AMSTAR 2, from another discipline also comes with issues as illustrated in the development of QAISER. There was a need to adapt AMSTAR 2 to SE, and hence AMSTAR 2 is not an option by itself. The differences between disciplines need to be captured in the appraisal instrument.

QAISER takes its starting point from a well-established appraisal instrument from another field, i.e., AMSTAR 2 from the field of evidence-based healthcare. Furthermore, QAISER incorporates best practices from the conduct of SLRs in SE. Thus, QAISER is well-grounded in the literature and contributes to taking the quality assessment of SLRs in SE one step forward.

The objective of QAISER is to support appraisers of SLRs in SE. QAISER is not intended as an instrument for scoring the quality of SLRs. On the contrary, QAISER is intended to support appraisers by covering the most salient quality aspects of an SLR. The expertise of the individual appraisers is crucial, and it cannot be replaced with an appraisal instrument such as QAISER.

Although the main objective is to support appraisers in assessing quality, we believe that authors of SLRs may also use QAISER to help them with improving the quality of their SLR before submitting the research for assessment. In the best of worlds, each submitted SLR is of high quality at the submission stage. It should be noted that the quality of an SLR is highly influenced by the quality of the primary studies included. The need to assess the quality of primary studies is highlighted by, for example, Dybå and Dingsøyr Dybå and Dingsøyr (2008), and Yang et al. Yang et al. (2020). With the same objective, Wohlin highlights the need to write for synthesis when publishing primary studies Wohlin (2014b).

We recommend all users of QAISER to not only look at the appraisal instrument itself but also the accompanying guidance document. The latter is particularly important when using the instrument for the first couple of times. We have also made available online a spreadsheet operationalizing the QAISER and three example assessments to further support appraisers in using QAISER.

The items and their response options in QAISER are intended to help highlight areas with weaknesses (or room for improvement). Given that assessment is prone to bias, we have deliberately chosen to have two or three levels for assessing each item. More levels may increase the risk for appraiser bias, although it may also benefit since the scale becomes more fine-grained. However, since QAISER is geared towards supporting appraisers of SLRs, we leave it to each appraiser to tune the feedback in writing using the comments option provided with each item, rather than having a more fine-grained scale.

When using QAISER for a mapping study, some items or questions may be less applicable than for an SLR, for example, the item concerning synthesis. We did consider adding an option of “not applicable” for mapping studies. We have chosen not to make the appraisal instrument more complex by adding the ”not applicable” option. Thus, we leave it to each appraiser to decide if something is not applicable for a mapping study. Our preference is to leave freedom to the appraiser, given that SLRs and mapping studies may come in different shapes and colors. Assessing SLRs and mapping studies is a subjective endeavour, and the objective of any appraisal instrument should be to support the expert appraiser.

10 Conclusion and Future Work

QAISER, as an appraisal instrument for SLRs in SE, is built on a well-established appraisal instrument from another discipline (AMSTAR 2), and a set of guidelines, checklists, and experiences from SE. Furthermore, four external experts on SLRs in SE have reviewed an earlier version of QAISER, and QAISER has been revised based on their feedback. Thus, QAISER is well-founded, and hence it is ready for further validation through usage.

QAISER includes 16 items and several response options for each item to assess for appraisers to arrive at an assessment for each item. QAISER provides support to consolidate the items on a group level, which is not done in AMSTAR 2. In QAISER, the items are consolidated into seven groups to support the appraiser to get a good overview of the strengths and potential weaknesses of an SLR. Moreover, QAISER has support for consolidating from the group level to the SLR level. The assessment of each group is systematically used to form an opinion about the overall quality of an SLR both in terms of reliability and relevance. AMSTAR 2 only provides an overall assessment of the confidence in the results. Given the importance of both reliability and relevance of the results for software engineering, we have provided support for both aspects.

In the future, we plan to evaluate the reliability and usability of QAISER by asking independent researchers to use it to assess the quality of selected SLRs. Based on such feedback, we plan to enhance QAISER further to support the software engineering community assessing SLRs.

Acknowledgements

We would like to express our sincere thanks to the external experts: Prof. Daniela S. Cruzes, Prof. Barbara Kitchenham, Prof. Stephen G. MacDonell and Prof. Kai Petersen for their constructive feedback on QAISER.

We would also like to thank Prof. Jürgen Börstler for his kind participation in the pilot of the study. His detailed feedback helped us to improve the planning and execution of the evaluations with external experts. We also extend our gratitude to Dr. Jefferson S. Molléri for providing the listing of articles from the work with CERSE.

This work has been supported by ELLIIT, a Strategic Area within IT and Mobile Communications, funded by the Swedish Government. The work has also been supported by research grants for the VITS project (reference number 20180127) and the OSIR project (reference number 20190081) from the Knowledge Foundation in Sweden.

References

  • N. B. Ali, K. Petersen, and C. Wohlin (2014) A systematic literature review on the industrial use of software process simulation. J. Syst. Softw. 97, pp. 65–85. Cited by: §2.
  • N. B. Ali and K. Petersen (2014) Evaluating strategies for study selection in systematic literature studies. In Proceedings of ACM-IEEE International Symposium on Empirical Software Engineering and Measurement, ESEM, pp. 45:1–45:4. Cited by: Appendix B. QAISER Guidance Document.
  • N. B. Ali and M. Usman (2018) Reliability of search in systematic reviews: towards a quality assessment framework for the automated-search strategy. Inf. Softw. Technol. 99, pp. 133–147. Cited by: §1, §1, item a, §9, Table 5, item 4.
  • N. B. Ali and M. Usman (2019) A critical appraisal tool for systematic literature reviews in software engineering. Inf. Softw. Technol. 112, pp. 48 – 50. External Links: ISSN 0950-5849 Cited by: §1, §2, §2, §2, §4.1, item 1.
  • A. Ampatzoglou, S. Bibi, P. Avgeriou, M. Verbeek, and A. Chatzigeorgiou (2019) Identifying, categorizing and mitigating threats to validity in software engineering secondary studies. Inf. Softw. Technol. 106, pp. 201 – 230. External Links: ISSN 0950-5849 Cited by: §2, §2, §3, §3, §4.1, §4.1, §4.1, Table 1, Table 5, Appendix B. QAISER Guidance Document.
  • D. Budgen, P. Brereton, S. Drummond, and N. Williams (2018) Reporting systematic reviews: some lessons from a tertiary study. Inf. Softw. Technol. 95, pp. 62 – 74. Cited by: §4.1, §4.1, Table 1, Table 5, Appendix B. QAISER Guidance Document.
  • Centre for Reviews and Dissemination, University of York (2019) Database of abstracts of reviews of effects (DARE). Note: 29 Nov, 2019 External Links: Link Cited by: §1, §2, §4.1, §4.1, §4.1, Table 1, item 2, §9, Table 5, Appendix B. QAISER Guidance Document.
  • N. Condori-Fernandez, R. J. Wieringa, M. Daneva, B. Mutschler, and O. Pastor (2012) An experimental evaluation of a unified checklist for designing and reporting empirical research in software engineering. Technical report Centre for Telematics and Information Technology (CTIT). Cited by: Appendix B. QAISER Guidance Document.
  • D. S. Cruzes and T. Dybå (2011) Research synthesis in software engineering: a tertiary study. Inf. Softw. Technol. 53 (5), pp. 440 – 455. External Links: ISSN 0950-5849 Cited by: §2, Appendix B. QAISER Guidance Document.
  • T. Dybå and T. Dingsøyr (2008) Strength of evidence in systematic reviews in software engineering. In Proceedings of the Second International Symposium on Empirical Software Engineering and Measurement, ESEM, H. D. Rombach, S. G. Elbaum, and J. Münch (Eds.), pp. 178–187. Cited by: §2, §9.
  • A. Gates, M. Gates, G. Duarte, M. Cary, M. Becker, B. Prediger, B. Vandermeer, R. M. Fernandes, D. Pieper, and L. Hartling (2018) Evaluation of the reliability, usability, and applicability of AMSTAR, AMSTAR 2, and ROBIS: protocol for a descriptive analytic study. Systematic reviews 7 (1), pp. 85. Cited by: item 1.
  • T. Greenhalgh (1997) How to read a paper: papers that summarise other papers (systematic reviews and meta-analyses). BMJ 315 (7109), pp. 672–675. Cited by: §2.
  • M. Höst and P. Runeson (2007) Checklists for software engineering case study research. In Proceedings of the first International Symposium on Empirical Software Engineering and Measurement (ESEM), pp. 479–481. Cited by: Appendix B. QAISER Guidance Document.
  • K. S. Khan, G. Ter Riet, J. Glanville, A. J. Sowden, J. Kleijnen, et al. (2001) Undertaking systematic reviews of research on effectiveness: CRD’s guidance for carrying out or commissioning reviews. Technical report Technical Report 4 (2n, NHS Centre for Reviews and Dissemination, University of York,UK. Cited by: §2.
  • B. A. Kitchenham, D. Budgen, and P. Brereton (2015) Evidence-based software engineering and systematic reviews. Vol. 4, CRC press. Cited by: §3, item a, item b, item 1, item a, item a, item c, §4.1, §4.1, §4.1, §4.1, Table 1, Table 5, item 1, item 3, item 3, Appendix B. QAISER Guidance Document, Appendix B. QAISER Guidance Document, Appendix B. QAISER Guidance Document, Appendix B. QAISER Guidance Document.
  • B. Kitchenham and P. Brereton (2013) A systematic review of systematic review process research in software engineering. Inf. Softw. Technol. 55 (12), pp. 2049 – 2075. Cited by: Appendix B. QAISER Guidance Document.
  • B. Kitchenham and S. Charters (2007) Guidelines for performing systematic literature reviews in software engineering. Technical report School of Computer Science and Mathematics, Keele University, Keele, UK. Cited by: §2, §2.
  • B. Kitchenham, R. Pretorius, D. Budgen, O. Pearl Brereton, M. Turner, M. Niazi, and S. Linkman (2010) Systematic literature reviews in software engineering – A tertiary study. Inf. Softw. Technol. 52 (8), pp. 792–805. Cited by: §2.
  • B. Kitchenham (2004) Procedures for Performing Systematic Reviews. Technical report Keele University, Keele, UK. Cited by: §1, §2.
  • S. G. MacDonell, M. J. Shepperd, B. A. Kitchenham, and E. Mendes (2010) How reliable are systematic reviews in empirical software engineering?. IEEE Trans. Software Eng. 36 (5), pp. 676–687. Cited by: §1.
  • E. Mendes, C. Wohlin, K. R. Felizardo, and M. Kalinowski (2020) When to update systematic literature reviews in software engineering. J. Syst. Softw. 167, pp. 110607. Cited by: item 1, Appendix B. QAISER Guidance Document.
  • J. S. Molléri, K. Petersen, and E. Mendes (2019) CERSE-Catalog for empirical research in software engineering: a systematic mapping study. Inf. Softw. Technol. 105, pp. 117–149. Cited by: §3.
  • I. Nurdiani, J. Börstler, and S. A. Fricker (2016) The impacts of agile and lean practices on project constraints: A tertiary study. J. Syst. Softw. 119, pp. 162–183. Cited by: §2.
  • K. Petersen, S. Vakkalanka, and L. Kuzniarz (2015) Guidelines for conducting systematic mapping studies in software engineering: an update. Inf. Softw. Technol. 64, pp. 1 – 18. Cited by: §3, item 1, §4.1, Table 1, Table 5, Appendix B. QAISER Guidance Document.
  • D. Pieper, R. B. Buechter, L. Li, B. Prediger, and M. Eikermann (2015) Systematic review found AMSTAR, but not r (evised)-AMSTAR, to have good measurement properties. J. Clin. Epidemiol. 68 (5), pp. 574–583. Cited by: item 1.
  • B. J. Shea, L. M. Bouter, J. Peterson, M. Boers, N. Andersson, Z. Ortiz, T. Ramsay, A. Bai, V. K. Shukla, and J. M. Grimshaw (2007a) External validation of a measurement tool to assess systematic reviews (AMSTAR). PLoS One 2 (12), pp. e1350. Cited by: item 1.
  • B. J. Shea, J. M. Grimshaw, G. A. Wells, M. Boers, N. Andersson, C. Hamel, A. C. Porter, P. Tugwell, D. Moher, and L. M. Bouter (2007b) Development of AMSTAR: a measurement tool to assess the methodological quality of systematic reviews. BMC Med. Res. Methodol. 7 (1), pp. 10. Cited by: item 1.
  • B. J. Shea, B. C. Reeves, G. Wells, M. Thuku, C. Hamel, J. Moran, D. Moher, P. Tugwell, V. Welch, E. Kristjansson, and D. A. Henry (2017) AMSTAR 2: a critical appraisal tool for systematic reviews that include randomised or non-randomised studies of healthcare interventions, or both. BMJ 358. Cited by: §1, §1, §2, §3, Table 1, Table 5, Appendix B. QAISER Guidance Document, Appendix B. QAISER Guidance Document.
  • D. F. Stroup, J. A. Berlin, S. C. Morton, I. Olkin, G. D. Williamson, D. Rennie, D. Moher, B. J. Becker, T. A. Sipe, S. B. Thacker, et al. (2000) Meta-analysis of observational studies in epidemiology: a proposal for reporting. JAMA 283 (15), pp. 2008–2012. Cited by: §2.
  • R. J. Wieringa (2012) Towards a unified checklist for empirical research in software engineering: first proposal. In Proceedings of the 16th International Conference on Evaluation & Assessment in Software Engineering, EASE, pp. 161–165. Cited by: Appendix B. QAISER Guidance Document.
  • C. Wohlin, P. Runeson, P. A. da Mota Silveira Neto, E. Engström, I. do Carmo Machado, and E. S. de Almeida (2013) On the reliability of mapping studies in software engineering. J. Syst. Softw. 86 (10), pp. 2594–2610. Cited by: §1.
  • C. Wohlin, P. Runeson, M. Höst, M. C. Ohlsson, B. Regnell, and A. Wesslén (2012) Experimentation in software engineering. Springer Science & Business Media. Cited by: Appendix B. QAISER Guidance Document.
  • C. Wohlin (2014a) Guidelines for snowballing in systematic literature studies and a replication in software engineering. In Proceedings of the 18th International Conference on Evaluation and Assessment in Software Engineering, pp. 38:1–38:10. Cited by: §3, item a, §4.1, Table 5.
  • C. Wohlin (2014b) Writing for synthesis of evidence in empirical software engineering. In Proceedings of the 8th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, ESEM ’14, New York, NY, USA, pp. 46:1–46:4. Cited by: §9.
  • L. Yang, H. Zhang, H. Shen, X. Huang, X. Zhou, G. Rong, and D. Shao (2020) Quality assessment in systematic literature reviews: a software engineering perspective. Inf. Softw. Technol., pp. 106397. Cited by: §9.

Appendix A. QAISER instrument

1. Did the authors of the review adequately justify the need for undertaking the review? Kitchenham et al. (2015); Petersen et al. (2015); Ampatzoglou et al. (2019)
For Partial Yes: For Yes:
The authors of the review should have: As for partial yes, plus the authors of the review should also have ALL of the following:
Identified existing related reviews on the topic, or explained that no related review exists.
Discussed related existing reviews on the topic, if any.
Established a scientific or practical need for their review Kitchenham et al. (2015).
Yes
Partial Yes
No
Comments:
2. Did the authors of the review establish a protocol prior to the conduct of the review? Shea et al. (2017); Ampatzoglou et al. (2019); Kitchenham et al. (2015)
For Partial Yes: For Yes:
The authors of the review state that they had a written protocol that is publicly available that should specify ALL of the following: As for partial yes, plus ALL of the following:
Appropriate review questions Shea et al. (2017); Kitchenham et al. (2015); Petersen et al. (2015)
Search process Shea et al. (2017); Kitchenham et al. (2015)
Study selection process Shea et al. (2017); Kitchenham et al. (2015); Ampatzoglou et al. (2019)
Data extraction process Kitchenham et al. (2015)
Study quality assessment process Shea et al. (2017); Kitchenham et al. (2015)
An outline of the data synthesis plan Shea et al. (2017); Kitchenham et al. (2015)
The justifications of any deviations from the protocol should be documented Shea et al. (2017); Ampatzoglou et al. (2019); Kitchenham et al. (2015).
The protocol should have been internally validated by piloting selection criteria, search strings, data extraction and synthesis processes Kitchenham et al. (2015).
The protocol should have been validated by an external reviewer Kitchenham et al. (2015).
Yes
Partial Yes
No
Comments:
3. Are the review’s inclusion and exclusion criteria appropriate?
For Yes: the review should have ALL of the following:
Reported the inclusion and exclusion criteria Centre for Reviews and Dissemination, University of York (2019); Kitchenham et al. (2015); Ampatzoglou et al. (2019).
The criteria are aligned with the review questions.
Provided appropriate justifications for any restrictions used in the inclusion and exclusion criteria (e.g.,topic-related scoping restrictions, time-frame, language, study type, and peer reviewed works only) Kitchenham et al. (2015); Shea et al. (2017).
Yes
No
Comments:
4. Did the authors of the review use a comprehensive literature search strategy? Shea et al. (2017); Kitchenham et al. (2015); Ampatzoglou et al. (2019); Centre for Reviews and Dissemination, University of York (2019)
When database search is used as the main method
For Partial Yes: For Yes:
The review should have All of the following: As for partial yes, plus the review should also have used:
An appropriate process for constructing the search strings including piloting Kitchenham et al. (2015); Ampatzoglou et al. (2019)
Search process validation based on an appropriate level of recall and precision using a known-set of papers Kitchenham et al. (2015); Ampatzoglou et al. (2019); Petersen et al. (2015)
At least one relevant indexing database (e.g., Scopus) in combination with relevant publisher databases (e.g., IEEE and ACM) Kitchenham et al. (2015)
Appropriately documented the search process (e.g., known-set, search strings, and search results) Kitchenham et al. (2015); Ampatzoglou et al. (2019); Ali and Usman (2018)
At least one additional search method (e.g., snowballing, manual search, or use DBLP or Google scholar of key researchers) Shea et al. (2017); Kitchenham et al. (2015); Ampatzoglou et al. (2019) Yes
Partial Yes
No
When snowballing is used as the main method
For Partial Yes: For Yes:
The review should have ALL of the following: As for partial yes, plus the review should also have used either ONE of the following:
Appropriately justified the use of snowballing as the main method Kitchenham et al. (2015)
Selected an appropriate start/seed set Wohlin (2014a)
Performed an acceptable number of backward and forward snowballing iterations Wohlin (2014a)
At least one additional search method (e.g., manual search, or use DBLP or Google scholar of key researchers) Wohlin (2014a); Kitchenham et al. (2015)
Snowballing iterations until no new papers were found Wohlin (2014a)
Yes
Partial Yes
No
Comments:
5. Did the authors of the review use a reliable study selection process? Shea et al. (2017); Kitchenham et al. (2015); Petersen et al. (2015)
For Partial Yes: For Yes, either ONE of the following:
At least two authors of the review selected a representative sample of eligible studies and achieved good agreement, with the remainder selected by one review author Shea et al. (2017); Kitchenham et al. (2015); Ampatzoglou et al. (2019).
At least two authors of the review independently agreed on selection of eligible studies, reached consensus on which studies to include, and reported the agreement level Shea et al. (2017); Kitchenham et al. (2015); Petersen et al. (2015).
OR if only a sample of studies were selected by two authors, all excluded studies were reviewed by at least two authors of the review.
Yes
Partial Yes
No
Comments:
6. Did the authors of the review use a reliable data extraction process? Shea et al. (2017); Kitchenham et al. (2015); Ampatzoglou et al. (2019); Petersen et al. (2015)
For Partial Yes: For Yes:
At least two authors of the review extracted data from a sample of included studies and achieved good agreement, with the remainder extracted by one review author Shea et al. (2017); Kitchenham et al. (2015); Petersen et al. (2015)
At least two authors of the review achieved consensus on which data to extract from the included studies Shea et al. (2017); Kitchenham et al. (2015); Ampatzoglou et al. (2019); Petersen et al. (2015) Yes
Partial Yes
No
Comments:
7. Did the authors of the review provide a list of excluded studies, along with the justifications for exclusion, that were read in full text? Shea et al. (2017); Kitchenham et al. (2015)
For Partial Yes, the review should have: For Yes, the review should also have:
Provided a list of all potentially relevant studies that were read in full text, but excluded from the review Shea et al. (2017); Kitchenham et al. (2015) Justified the exclusion from the review of each potentially relevant study that was read in full text Shea et al. (2017) Yes
Partial Yes
No
Comments:
8. Did the authors of the review provide sufficient primary studies’ characteristics to interpret the results? Shea et al. (2017); Kitchenham et al. (2015); Ampatzoglou et al. (2019); Centre for Reviews and Dissemination, University of York (2019)
For Yes, the review should have described ALL of the following:
Populations Shea et al. (2017)
Interventions, when applicable Shea et al. (2017)
Outcomes, when applicable Shea et al. (2017)
Study types Shea et al. (2017)
Study contexts Shea et al. (2017)
Yes
No
Comments:
9. Did the authors of the review use an appropriate instrument for assessing the quality of primary studies that were included in the review? Shea et al. (2017); Kitchenham et al. (2015)
For Yes, the review should have used appropriate instruments for different types of studies included in the review. An appropriate instrument would have questions related to ALL of the following Kitchenham et al. (2015):
”The goals, research questions, hypotheses and outcome measures” Kitchenham et al. (2015)
”The study design and the extent to which it is appropriate to the study type” Kitchenham et al. (2015)
”Study data collection and analysis and the extent to which they are appropriate given the study design” Kitchenham et al. (2015)
”Study findings, the strength of evidence supporting those findings, the extent to which the findings answer the research questions, and their value to practitioners and researchers” Kitchenham et al. (2015)
Yes
No
Comments:
10. Did the authors of the review use a reliable quality assessment process? Kitchenham et al. (2015)
For Partial Yes: For Yes:
At least two authors of the review performed quality assessment of a representative sample of eligible studies and achieved good agreement, with the remainder performed by one review author Kitchenham et al. (2015)
At least two authors of the review independently performed quality assessment of eligible studies, reached consensus, and reported the agreement level Kitchenham et al. (2015) Yes
Partial Yes
No
Comments:
11. Were the primary studies appropriately synthesized? Centre for Reviews and Dissemination, University of York (2019); Kitchenham et al. (2015); Ampatzoglou et al. (2019); Budgen et al. (2018)
For Yes, the review should have ALL of the following:
Selected an appropriate synthesis method given the review questions and extracted data Kitchenham et al. (2015); Ampatzoglou et al. (2019); Budgen et al. (2018)
Applied the selected synthesis method appropriately
Provided a clear trace linking the answers of review questions and interpretations to the data from the primary studies
Yes
No
Comments:
12. Did the authors of the review account for quality of individual studies when interpreting/discussing the results of the review? Shea et al. (2017); Kitchenham et al. (2015); Ampatzoglou et al. (2019)
For Yes, either ONE of the following:
Included only high-quality studies Shea et al. (2017); Kitchenham et al. (2015)
OR, the authors have categorized the analysis and interpretation of results based on study quality Shea et al. (2017); Kitchenham et al. (2015)
Yes
No
Comments:
13. Did the authors of the review account for primary studies’ characteristics when interpreting/discussing the results of the review?
For Yes, either ONE of the following:
There were no significant similarities or differences to warrant a separate analysis
OR, the authors have appropriately accounted for study characteristics and discussed their impact on the review results Shea et al. (2017); Kitchenham et al. (2015)
Yes
No
14. Did the authors of the review provide appropriate recommendations and conclusions from the review? Budgen et al. (2018)
For Partial Yes, the review should have: For Yes, the recommendations and conclusions should also be
Provided satisfactory recommendations and conclusions based on the review results
Clearly traceable back to the review results
Clearly targeting specific stakeholders
Well aligned with the upfront motivation for undertaking the review, or are any deviations well explained
Providing new valuable insights to the community
Yes
Partial Yes
No
Comments:
15. Did the authors of the review report their own potential sources of conflict of interest, including any funding they received for conducting the review? Shea et al. (2017); Kitchenham et al. (2015)
For Yes, either ONE of the following:
The authors reported no competing interests
OR, the authors described their funding sources Shea et al. (2017); Kitchenham et al. (2015) and how they managed potential conflicts of interest Shea et al. (2017)
Yes
No
Comments:
Table 5: QAISER Instrument.

Appendix B. QAISER Guidance Document

In this document, we provide further guidance to support a consistent interpretation of items in QAISER.

Item 1: Did the authors of the review adequately justify the need for undertaking the review?
A large number of SLRs are reported in software engineering every year. A review should be initiated on the basis of a practical or scientific need. The authors of the review should also extensively search for any existing reviews or mapping studies on the topic. The authors should only continue planning the review if there are no existing ones that are up to date on the specific area Kitchenham et al. (2015). Mendes et al. Mendes et al. (2020) provide support to decide if a review should be updated.

To score ’Partial Yes’, appraisers should check if the authors of the view have made a sufficiently extensive effort to identify related reviews on the topic.

To score ’Yes’, appraisers should ensure that the authors of the review have established the need for undertaking the review. If there are existing reviews on the same topic, the authors need to establish the need by highlighting the gap in the existing reviews, and explaining how their review is going to fill the gap. In case there are no existing reviews on the topic, the authors explain why is it essential to aggregate the evidence on the topic.

The information about the need for review is typically described in the background or related work sections of the report.

Item 2: Did the authors of the review establish a protocol prior to the conduct of the review? To reduce the risk of bias, it is important that the authors of the review have developed and validated a written protocol before commencing the review.

To score ’Partial Yes’, appraisers should first check that the protocol is accessible and the review report describes where and how can it be accessed. Furthermore, the protocol should have documented appropriate review questions, the processes for search, study selection, data extraction, quality assessment and at least an outline for the data synthesis plan.

To rate ’Yes’, the protocol should have been validated both internally and by an independent reviewer and the authors of the review should have clearly documented and justified any deviations from the protocol and discuss their impact on the study. If the appraisers notice that the review report contains unexplained deviations from the protocol, they should downgrade the rating.

The above information about the protocol is typically described in the methodology section of the review report.

Item 3: Are the review’s inclusion and exclusion criteria appropriate? A review should use documented selection criteria Ampatzoglou et al. (2019); Shea et al. (2017); Kitchenham et al. (2015); Petersen et al. (2015); Centre for Reviews and Dissemination, University of York (2019).

To score ’Yes’, appraisers should ensure that the authors of the review have justified any restrictions, e.g., on research designs, the time frame of publication, and the type of publications imposed in the selection process. Furthermore, the justification should also address the likely impact of the restrictions on the studied population and the generalization of the findings.

The selection criteria and the justifications for any restrictions are expected to be found in the methodology or limitations/threats to the validity section of the review report. Furthermore, some of the exclusion criteria may have been implemented in the search process.

Item 4: Did the authors of the review use a comprehensive literature search strategy? A comprehensive search strategy is important to maximize the coverage of the relevant literature. The authors of the review should provide a justification for using a particular search method (e.g., database or indexing service search or snowballing) as a primary method for searching the relevant literature.

To rate ’Partial Yes’, appraisers should check the following in case of a database or indexing service search as the primary search method:

  • The authors of the review have validated their search process by comparing their search results with a known-set of papers. The known-set of papers are the relevant papers that are already known to the authors of the review based on, for example, manual search or their knowledge of the review topic. The validation is performed by computing recall and precision using the search results and the known set of relevant papers (for details, refer to Kitchenham et al. (2015)).

  • The authors of the review have used an appropriate process for identifying the search terms, synonyms and constructing the search strings.

  • The authors of the review have used a combination of publisher databases and indexing services. IEEE and ACM are the most relevant publisher databases in software engineering, as they publish the most important and relevant conferences and journals in software engineering Kitchenham et al. (2015). As a minimum, the authors should have used IEEE and ACM among the publisher databases and one indexing service (e.g. Scopus).

  • The authors of the review have documented the search process. Appropriate documentation of the search process is important to ensure repeatability and transparency. The authors of the review should document: general and database specific search strings, total and database specific search results, search filters (e.g., years) used, date when the search strings were applied, known-set of papers used for validation, and validation measures (i.e. recall and precision) for details see Ali and Usman (2018).

To rate ’Yes’, the authors of the review should have also used at least one additional search method (e.g., snowballing).

To rate ’Partial Yes’, appraisers should check the following in case of a snowballing search as the primary search method:

  • The authors of the review have used an appropriate process for identifying the seed set for starting the snowballing procedure. The way of identifying the seed set is well-documented and -motivated.

  • The authors have iterated the snowballing procedure until no more papers are found.

  • The authors of the review have validated their search process by comparing their search results with a known-set of papers. The known-set of papers are the relevant papers that are already known to the authors of the review based on, for example, manual search or their knowledge of the review topic. The validation is performed by computing recall and precision using the search results and the known set of relevant papers (for details, refer to Kitchenham et al. (2015)).

  • The authors of the review have documented the search process. Appropriate documentation of the search process is important to ensure repeatability and transparency. The authors of the review should document: identification of the seed set, the different iterations conducted, known-set of papers used for validation, and validation measures (i.e., recall and precision).

To rate ’Yes’, the authors of the review should have also used at least one additional search method (e.g., manual search of key journals or conference proceedings, or use DBLP or Google scholar of key researchers) or continuing snowballing iterations until no new papers are found).

Item 5. Did the authors of the review use a reliable study selection process? To reduce bias and the possibility of making mistakes, the crucial step of inclusion and exclusion of the papers should involve at least two reviewers Ali and Petersen (2014); Shea et al. (2017).

To rate ’Partial Yes’, appraisers should check if at least two authors of the review selected a representative sample of eligible studies and achieved good agreement, with the remainder selected by one review author. A single reviewer should only proceed with the selection after a Kappa score indicating strong agreement between multiple authors of the review has been reached.

To rate ’Yes’, appraisers should check that one of the two following processes are followed during study selection: 1) two authors of the review independently performed study selection on all eligible studies, reached consensus and also reported the agreement level, 2) two authors of the review selected a sample of eligible studies, achieved good agreement level, with the remainder selected by one review author, but in that case all excluded studies must be reviewed by at least two authors of the review. A single reviewer should only proceed with the selection after a Kappa score indicating strong agreement between multiple authors of the review has been reached. However, even in this case, the excluded studies should be reviewed by other authors of the review as well. Appraisers should also check that the rules for inclusion and exclusion, and how these rules were applied and how any differences between reviewers were resolved are described. Furthermore, the report should also report the number of papers remaining at each stage Budgen et al. (2018).

The information about the study selection process is expected to be described in the methodology and results sections of the review.

Item 6. Did the authors of the review use a reliable data extraction process? To ensure repeatability of the study and to avoid bias, it is important that the data extraction is not solely performed by a single researcher.

To rate ’Partial Yes’, appraisers should check if the authors have performed pilot extraction of data on a sample of the included studies to develop a shared understanding of the data extraction form.

To rate ’Yes’, appraisers should ensure that data is extracted by at least two authors of the review either from all included studies OR from at least a sample of included studies. It is important to check that the review report provides a description of the mechanism used to achieve consensus and shared understanding on which data to collect.

The information of the data extraction process is generally described in the methodology section of the review report.

Item 7: Did the authors of the review provide a list of excluded studies, along with the justifications for exclusion, that were read in full text? This item refers to studies that were deemed relevant by authors of the review on a reading of the title and abstracts. However, after full-text reading, the authors concluded that the papers are not relevant to the current review. It is expected that the authors of the review should document such papers along with the reason for their exclusion. This will help increase confidence in the results, allow reflecting on the selection criteria used in the study, allow replications, and enable further research (for example, by leveraging on the filtered list of papers for a different analysis).

To rate ’Partial Yes’, appraisers should see that the authors of the review have provided a list of such potentially relevant papers. In order to rate ’Yes’, justifications for excluding the potentially relevant papers should also be provided.

This documentation (i.e., a list of potentially relevant papers that were excluded after full-text reading and justifications for excluding them) can be made available in an appendix or as supplementary material for review online (along with other supporting material like the review’s protocol).

Item 8: Did the authors of the review provide sufficient primary studies’ characteristics to interpret the results? The relevance and reliability of a systematic review depends, besides other factors, also on a number of factors related to the included studies such as its type (e.g., case study, survey, and experiment), context (real life or laboratory setting), participants (practitioners or students), and publication venue (e.g., a reputable conference/journal). The review report should describe adequate details about the characteristics of the included studies to inform the review readers about the kind of evidence that is used to draw conclusions.

To rate ’Yes’, appraisers should ensure that the authors of the review have provided enough details about the population, interventions (when relevant), outcomes (when relevant), research designs and settings of the included studies.

These details may not be described at one place in a review report, and therefore could be challenging to find. Normally, part of this information is described in the start of the results section in a review report.

Item 9: Did the authors of the review use an appropriate instrument for assessing the quality of primary studies that were included in the review? Due to several reasons, including the variety of research designs used in primary studies, reporting quality, use of inconsistent terminology, etc., quality assessment is a challenging task in software engineering systematic literature reviews Kitchenham et al. (2015); Kitchenham and Brereton (2013). Several research-design specific checklists (e.g. for experimentation Wohlin et al. (2012) and case study research Höst and Runeson (2007)) and generic instruments (e.g. Wieringa (2012); Condori-Fernandez et al. (2012)) have been proposed in literature. However, as concluded by Kitchenham Kitchenham et al. (2015), it is not feasible to use the same instrument to assess the quality of different types of studies.

To rate ’Yes’, appraiser should ensure that the choice of the instruments used (whether an existing one or one formulated by the authors of the review) has been justified given the goals of the SLR and nature of included studies. Furthermore, the instrument used is expected to evaluate at least the research design, data collection, analysis reporting, and the strength of evidence given the stated goals of the primary study.

The information on the quality assessment of the primary studies is expected to be described in the methodology and results sections of the review report.

Item 10: Did the authors of the review use a reliable quality assessment process? Like Item 5 and 6, it is important that the quality assessment is not performed solely by a single author of the review.

To rate ’Partial Yes’, appraisers should check if at least two authors have performed pilot quality assessment of a sample of the included studies to evaluate the objectivity of the quality assessment criteria and to develop a shared understanding of it.

To rate ’Yes’, appraisers should see that at least two authors of the review independently performed the quality assessment of either all included studies or a sample of included studies (with the remaining performed by one review author) and achieved good consensus. The review report should also describe how differences were resolved in case of different quality scores.

The information about the quality assessment process is typically expected in the methodology section of the review report.

Item 11: Were the primary studies appropriately synthesized? Synthesis is one of the most important and also challenging parts of a systematic literature review. Without synthesis, the review would be of limited use.

In order to score ’Yes’, appraisers need to see if the authors of the review have used and justified an appropriate method for synthesis. It may be the case that the authors of the review do not use the correct or appropriate name for the used synthesis method Cruzes and Dybå (2011). In that case, appraisers would have to carefully read the review report in order to make a decision on this item. The appraisers should further check if the selected synthesis method was appropriately applied and that there is a clear chain of evidence from the answers to the research questions to the data from the primary studies.

The information about the synthesis method and its output may be documented in a separate section. In some cases it may be described in the discussion section after the results section. It could also be the case that the justification for selecting a specific synthesis method is described in the research methodology section, while the outputs of the synthesis step are described in a separate section.

Item 12: Did the authors of the review account for quality of individual studies when interpreting/ discussing the results of the review? A review should take the quality of the individual studies into account when interpreting the results. This will increase the confidence in the findings and conclusions of the review.

To rate ’Yes’, appraisers should see that either the review has excluded studies that do not meet the quality criteria defined in the study, or the analysis and conclusions are separated based on the quality of the included studies. The information on using the quality of studies while interpreting the results is expected to be described in the discussion or analysis sections of the review report where results are further discussed/analyzed to draw conclusions.

Item 13: Did the authors of the review account for primary studies’ characteristics when interpreting/discussing the results of the review? There are many factors that can cause heterogeneity in the results of the included studies. It is important to analyze the causes of the heterogeneity in results, if any, while interpreting the results and drawing any conclusions. For example, it could be the variations in the contextual factors (e.g., student versus practitioners as subjects) that lead to differences in the results of different studies. Furthermore, quality scores or some specific quality criteria might also help in explaining the heterogeneity observed in the results Kitchenham et al. (2015). This item is concerned with the use of study characteristics in item 8.

In order to rate ’Yes’, appraisers should see that the authors of the review have investigated the impact of study characteristics on the findings of the review.

This discussion is likely to be found after the results section of the review report.

Item 14: Did the authors of the review provide appropriate recommendations and conclusions from the review? The usefulness of results of the review for the target stakeholders is critical to assess the relevance of the review. This item is a reflection on the aims as motivation for the review assessed in the first item of the instrument (i.e., item 1).

For ‘Partial Yes’ the review should have satisfactory recommendations and conclusions based on the review results.

For ‘Yes’ the recommendations and conclusions from the review shall be traceable to the review results, clearly targeting specific stakeholders, well aligned with the motivation and provides new insights to the community.

Item 15: Did the authors of the review report their own potential sources of conflict of interest, including any funding they received for conducting the review? To ensure the reliability of a review, it is important that the authors of the review report their sources of funding and any other conflicts of interest. The disclosure of the sources of funding is quite obvious. However, identifying other types of conflicts of interest is not that straightforward.

For example, if the authors of the review have published on the topic of the review or have a vested interest in the outcome of the review, there is a potential for bias when selecting, analyzing and interpreting their own work and studies with competing alternatives.

It is encouraged that authors of the review should be experts in the topic area, so it is common that they have published extensively in the topic area. Thus, it is important that the authors of the review document report their effort in identifying any conflicts of interest they have, which is relevant for the review. A mitigation strategy in this case is to establish a protocol and have it reviewed by independent researchers not participating in the literature review.

To rate ’Yes’, the appraisers should ensure that the authors of the review have reported on the presence or absence of any conflicts of interest. In case there was some conflict of interest, the authors of the review should have described and justified the steps taken to mitigate the threat of bias in the results of the review.