A Systematic Literature Review of Automated Techniques for Functional GUI Testing of Mobile Applications

Context. Multiple automated techniques have been proposed and developed for mobile application GUI testing aiming to improve effectiveness, efficiency, and practicality. The effectiveness, efficiency, and practicality are 3 fundamental characteristics which testing techniques are built upon, and need to be continuously improved to deliver useful solutions for researchers and practitioners, and community as a whole. Objective. In this systematic review, we attempt to provide a broad picture of existing mobile testing tools by collating and analysing their conceptual, and also performance characteristics including an estimation of effectiveness, efficiency, and practicality. Method. To achieve our objective, we specify 3 primary, and 14 secondary review questions, and conducted an analysis of 25 primary studies. We first individually analyse each primary study, and next analyse the primary studies as a whole. We developed a review protocol which defines all the details of our systematic review. Results. From effectiveness, we conclude that testing techniques which implement model-checking, symbolic execution, constraint solving, and search-based test generation approach tend to be more effective than those implementing random test generation. From efficiency, we conclude that testing techniques which implement code search-based testing approaches tend to be more efficient than those implementing GUI model-based. From practicality, we conclude that the more effective a testing technique is, the less efficient it will be. Conclusion. For effectiveness, we observe that the existing automated testing techniques are not effective enough, and currently they achieve nearly half of the desired level of effectiveness. For efficiency, we observe that current automated testing techniques are not efficient enough.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

09/24/2019

A Systematic Literature Review of Test Breakage Prevention and Repair Techniques

Context: When an application evolves, some of the developed test cases b...
06/14/2021

JUGE: An Infrastructure for Benchmarking Java Unit Test Generators

Researchers and practitioners have designed and implemented various auto...
03/12/2020

Analyzing the Impact of Automated User Assistance Systems: A Systematic Review

Context: User assistance is generally defined as the guided assistance t...
04/21/2022

AI-Based Automated Speech Therapy Tools for persons with Speech Sound Disorders: A Systematic Literature Review

This paper presents a systematic literature review of published studies ...
03/30/2022

Exploring ML testing in practice – Lessons learned from an interactive rapid review with Axis Communications

There is a growing interest in industry and academia in machine learning...
02/05/2021

Using Visual Text Mining to Support the Study Selection Activity in Systematic Literature Reviews

Background: A systematic literature review (SLR) is a methodology used t...
07/29/2019

A Case Study on Automated Fuzz Target Generation for Large Codebases

Fuzz Testing is a largely automated testing technique that provides rand...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A systematic literature review (in short “systematic review”) is useful means of collating and summarizing the results of all existing research works which are relevant to a particular research question, or topic, or phenomenon of interest fink2013conducting5 ; petticrew2008systematic ; kitchenham2004procedures7 ; Kitchenham8 . Systematic literature review is a form of secondary study which uses a well-defined methodology to identify, analyse and interpret the results of the related research works in an unbiased manner, which also should be repeatable, at least to a certain degree. The research works used in systematic reviews are called primary studies. In fact, the most reliable evidence of knowledge can be seen from the primary studies which are aggregated on a particular topic into one single place, i.e., systematic reviews. Therefore, systematic literature reviews are the recommended form of aggregation for empirical studies fink2013conducting5 ; petticrew2008systematic ; khan2011systematic4 .

The original purpose for conducting systematic literature reviews is to support evidence-based medicine. In fact, software engineering-related research has relatively little empirical results compared to medical one. Also, software engineering research methods are not as rigorous as those used in the medical domain. As such, Kitchenham adapted the medical guidelines for systematic reviews to software engineering kitchenham2004procedures7 . Later on, the initial software engineering guidelines were updated including insights from sociology research Kitchenham8 .

A systematic literature review is a form of data collating and analysis, which requires considerably more efforts than a conventional literature review. First, systematic reviews should define a review protocol which specifies the review questions to be addressed, and research methods which are to be used to perform the review.111defining review protocol is necessary to reduce the possibility of researcher bias. After specifying the review questions, systematic reviews uses the review protocol to describe its review process which includes research methods. In the review protocol, systematic reviews should specify search strategy, study selection process using inclusion and exclusion criteria of primary studies, quality assessment criteria, data collection and analysis, and dissemination strategy.

For search strategy, systematic reviews should specify search terms and resources which will be searched for primary studies. Resources may include digital libraries, specific journals, conference proceedings, gray literature, the Internet, and others. For study selection, systematic reviews should specify inclusion and exclusion criteria which are used to determine which studies are included in, or excluded from, a systematic review. For quality assessment, systematic reviews should specify quality questions to assess the data quality of the primary studies. For data collection and analysis, systematic reviews should specify how the information required from each primary study will be obtained, and how the data will be presented to help answer review questions. For dissemination strategy, systematic reviews should specify how the results will be circulated to potentially interested parties.

Conventional systematic literature reviews aggregate results related to a specific review question, e.g., “Is testing technique A more effective at faults detection than B ?” However, there are two other types of systematic studies which complement systematic literature reviews such as systematic mapping and tertiary studies. Systematic mapping studies have more broad review questions, e.g., “What is the current status of the topic of interest X ?KITCHENHAM2010792 . A systematic mapping

study allows to classify the primary studies in a specific topic area at a high level of granularity. This helps to identify areas for more primary studies to be conducted. A

systematic tertiary study can be performed in a domain where a number of systematic reviews already exist. It is seen as a review of secondary studies related to the same research question, which is a systematic review of systematic reviews, in order to answer even wider review questions.

There are 3 main stages in a systematic review: planning, conducting, and reporting. During the planning phase, the systematic review should identify the need for a review, specify review questions, and develop review protocol. During the conducting phase, the systematic review should specify search strategy, selection criteria, quality assessment criteria, and data extraction and synthesis. During the reporting phase, the systematic review should specify dissemination strategy, and format of the final report.

In this systematic review, we focus on automated functional GUI testing of mobile applications (simply “apps”). We specify 3 review questions regarding effectiveness, efficiency, and practicality of the testing techniques. To the best of our search, we are the first who specify such review questions, and also in the topic of interested related to an automated GUI testing in mobile. In fact, mobile research is relatively new area, so there may not be many primary studies which can be aggregated by systematic reviews in unbiased manner. So, during our search for the related systematic studies, we found 1 systematic literature review slr1 ,222this systematic literature review also includes mapping study as a part. 4 systematic mapping studies slr1 ; map1 ; map2 ; map3 , and 1 survey surv1 , all of which are conducted for mobile software testing.

In this systematic review, for primary studies search, we use 11 digital libraries, 8 academic search engines, 66 individual journals, and 34 conferences. We search for the primary studies which are published between 2010 and 2018 years, inclusively. Using our search strategy, we found 3,639 primary studies in total. After applying our selection strategy, we obtained 47 primary studies which are relevant to an automated testing in mobile. Next, using our quality assessment criteria, we selected 25 primary studies as a final set for analysis, and excluded 22 primary studies.

This systematic review has been prepared using the suggested guidelines for performing systematic literature reviews in software engineering slrguide . Following the suggested structure and contents of the final reports for systematic reviews, we organized our systematic review as follows. In Section 2, we describe our review protocol used to conduct this systematic review. In Section 3, we give a background of the topic of interest including summary of previous systematic studies, motivation of this systematic review, and specifying review questions. In Section 4, we conduct our systematic review. We specify search strategy and data sources, study selection process using inclusion and exclusion criteria, study quality assessment, and data extraction and synthesis. In Section 5, we discuss the results of the primary studies, their benefits, adverse effects and gaps. We also discuss possible variations of the results with effect of their applications on larger scales. In Section 6, we discuss the validity of the results considering bias in our systematic review. In Section 7, we summarize the results, and give recommendations to the researchers for possible improvements of existing techniques, and the future research directions. Also, for practitioners, we highlight practical implications based on the results of this systematic review.

2 Review protocol

Our review protocol has been developed following the guidelines for performing systematic literature reviews in software engineering slrguide .

Background

We first give a background, Section 3, where we introduce common concepts of event-driven software, graphical user interface and importance of its testing. We also, provide a summary of related previous systematic reviews (Section 3.1), explain motivation of this systematic review, and specify review questions to be answered by this review (Section 3.2).

Data sources

We identify data sources to be searched, Section 4.1, where we introduce 11 major digital libraries (Section 4.1.1), 8 major academic search engines (Section 4.1.2), 66 individual journals (Section 4.1.3, Table 1), and 34 conferences (Section 4.1.4, Table 2).

Search strategy

We describe search strategy, Section 4.2, where we identify search string to be used to search for primary studies. We search all our data sources for primary studies which are published between 2010 and 2018 inclusively, and related to graphical user interface testing in mobile. In total, our search strategy has found 3,639 primary studies (Table 3).

Study selection strategy

We describe study selection process, Section 4.3, where we determine which studies are included in, and excluded from, our systematic review. For study selection, we assigned 3 authors. If there have been any disagreements, they were resolved during our group meeting with supervisor.

To apply general criteria for inclusion and exclusion of primary studies (Section 4.3.1, Table 4), we equally distributed the found number of primary studies among 3 authors. After this filtering step, in total, we obtained 47 primary studies. Next, we performed quality assessments of 47 selected primary studies. For each data quality question, we assign weighting coefficient from [0…1] with step 0.1 depending on to which extent this particular study answers the quality question; ‘0’ indicates poor data quality, i.e., study does not answer this particular data quality question at all, while ‘1’ indicates excellent data quality for this particular data quality question.

To apply quality criteria for inclusion and exclusion of primary studies (Section 4.3.2, Table 5), we assigned all 47 primary studies to all 3 authors. Using our quality questions, each author assigned his/her own weighting coefficients to each quality question for each primary study. Next, the assigned values were averaged to obtain the final weighting coefficient for each quality question. By summing the obtained average values, we computed a total weight for each primary study. Upon discussion in our research group, for inclusion of a primary study, we set a minimum threshold for the total weight (quality index) of the primary study. If the total weight of primary study is 2.50 (i.e., 50% of the possible maximum 5.00) or more, we include the study in this systematic review, otherwise we consider that the study is of poor quality, and thus it should be excluded from consideration. After this filtering step, in total (Table 6), we included 25 (Table 7), and excluded 22 (Table 8) primary studies.

Data extraction strategy

We describe data extraction process, Section 4.4, where we define how the information required from each primary study will be obtained. We developed 3 data forms to collect and tabulate the data in a way helping us to answer secondary and primary review questions of this systematic review. The example data forms are shown in Table 9, Table 10, and Table 11.

For data extraction, we assigned 2 authors. Each author extracted data independently from all 25 primary studies. Next, the filled data forms were collected from the authors, and the provided data was compared. If there have been any conceptional mismatches, they were resolved during our discussion with the other 2 authors of this systematic review.

Data synthesis

We describe the extracted data, Section 4.5, where we tabulate the data in a way helping to answer our review questions (Table 12, Table 13, and Table 14). Undertaking a descriptive synthesis, we do not perform any meta-analysis of primary studies. We collate and summarize the extracted data using “Line of argument synthesis” approach noblit1988meta , where we first individually analyse each primary study, and next analyse the primary studies as a whole.

Dissemination strategy

We provide the results, Section 5, where we discuss the findings of the primary studies. We report our results in 2 formats: (1) as a journal paper, and (2) as a chapter of a PhD thesis.

3 Background

A Graphical User Interface (GUI) is acknowledged as a crucial component of an event-driven software (e.g., mobile apps) 5401169 ; chaudhary2016metrics ; 6571622 . In the event-driven software, e.g., real-world apps, the app GUI usually contain hundreds or even thousands of elements mckain2008graphical . According to 637386 ; memon2002gui ; Myers1995 , a large part of the app code is dedicated to the user interface so its testing is an essential part of the software development life cycle (SDLC),333software development life cycle is the process of dividing development work into several phases to improve design, product, and project management in a cycle. which may significantly improve the overall quality of software Harrold2000 ; mcconnell1996 . During testing phase, GUI can be tested by executing each event individually and observing its behaviour 908959 . However, it is not a trivial task since the behaviour of an event handler may depend on GUI internal state, the state of other entities (objects, event handlers) and the external environment. Furthermore, the outcome of an event handler execution may vary depending on the concrete sequence of preceding events. As a result, each GUI event needs to be tested in a context of different states via generating and executing various sequences of GUI events acteve36 ; Yuan2007 .

The impetus of GUI is to simplify a user interaction with the app. GUI takes user actions (e.g., touches, selections, typing etc.) as input, and changes the state of its GUI elements by translating the user actions into the platform-specific event handlers to execute corresponding app functionality. Providing such “event-handler architecture”, event handlers may be created and maintained fairly independently so that complex software may be built using these loosely coupled pieces of code while offering many degrees of usage freedom via its GUI (e.g., users may choose to perform a particular task in different ways in terms of possible user actions, their number and execution order).

Modern mobile apps have a highly interactive nature and complex GUI structure. As such, automated GUI testing of mobile apps is a daunting task for the developers and testers. In fact, often the GUI testing is done manually where all possible combinations of the GUI elements for a given app screen are manually tested for functional correctness and aesthetic quality. The manual GUI testing is no doubt an effective approach, however it is inefficient, i.e., time-consuming, error-prone, and usually not complete, especially for a large software with complex GUIs. So, to facilitate manual testing, various automated testing techniques have been introduced such as model-based testing utting ; utting2010practical ; sof207 ; broy2005model ; 1553582 ; Dalal1999 , concolic testing Sen2007 ; acteve19 ; paquesymbolic , search-based testing baldoni2016survey ; 5954405 , evolutionary testing WEGENER2001841 ; evodroid37 , and combinatorial testing autodroidcomb . Depending on which testing technique is implemented, the GUI testing can be performed via dynamic or static program analysis methods, or their combination.

3.1 Previous systematic studies

For the related systematic studies which are conducted for mobile software testing, we found 1 systematic literature review slr1 , 4 systematic mapping studies slr1 ; map1 ; map2 ; map3 , and 1 survey surv1 . We describe them in a sequential order starting from the systematic literature review, next mapping studies, and at last the survey.

Systematic mapping and literature review

This systematic study slr1 was conducted by 3 authors in 2015. The work is entitled “Automated Testing of Mobile Applications: A Systematic Map and Review”, and comprises 15 pages. Note that this systematic study combines mapping and literature review. The authors conducted the systematic study to identify and collate evidence about current state of research in an automated GUI testing of mobile apps. They identified and characterized automated testing approaches and techniques, and investigated major challenges via analysis and synthesis of the selected primary studies.

To drive the systematic study, the authors specified 3 review questions, and 3 mapping questions.

  1. review: What are the challenges of automated testing of mobile applications?

  2. review: What are the different approaches for automated testing of mobile applications?

  3. review: What is the most used experimental method for evaluating automated testing of mobile applications?

  4. mapping: Which are the main journals and conferences for automated testing of mobile applications?

  5. mapping: Which are the main authors for automated testing techniques research?

  6. mapping: How is the frequency of papers distributed according to their testing approach?

In total, 83 primary studies were selected for analysis. The authors tabulated and synthesized the results in a way helping practitioners to provide recommendations about automated testing of mobile apps. The popularity of the main approaches was identified: model-based testing (30%), capture/replay (15.5%), model-learning testing (10%), systematic testing (7.5%), fuzz testing (7.5%), random testing (5%) and scripted based testing (2.5%). The authors conclude that the number of the proposed approaches and techniques for automated testing of mobile apps has increased. They also highlight that in 40% of the selected primary studies, the automated testing techniques use GUI-based models of the target apps.

Systematic mapping study

This systematic mapping map3 was conducted by 3 authors in 2015–2016 years. The work is entitled “A systematic mapping study of mobile application testing techniques”, and comprises 23 pages. The authors conducted the systematic mapping to categorize and structure the research evidence which has been available so far in the area of mobile apps testing including their approaches, techniques and challenges they addressed.

To drive the systematic mapping study, the authors specified 1 primary, and 3 secondary research questions:

  1. What are the studies that empirically investigate mobile and smart phone application testing techniques and challenges?

    1. What research approaches do these studies apply and what contribution facets do they provide?

    2. What kind of applications (industrial or simple) do these studies use in order to evaluate their solutions?

    3. Which journals and conferences included papers on mobile application testing?

In total, 79 primary studies were selected and classified. The authors identified several research gaps, and key testing issues which could be interesting to practitioners. They report that only few studies do investigation on real-world mobile environments, and focus on eliciting testing requirements in the requirements engineering phase. The authors also highlight that there is no clear guidance for practitioners how to choose an automated testing tool, or testing technique from variety of available ones. They authors suggest that there is a need for a clear road-map to guide the practitioners, and for researchers, there is need for more studies which address the issues of conformance to life cycle models, mobile services and mobile testing metrics.

Systematic mapping study

This systematic mapping map2 was conducted by 3 authors in 2015. The work is entitled “Mobile Application Verification: A Systematic Mapping Study”, and comprises 17 pages. The authors conducted the systematic mapping focusing on software verification aspect of the mobile applications. They found definitive metrics and research evidence about mobile application testing, which could be helpful for researchers to identify possible gaps and new research directions.

To drive the systematic mapping study, the authors specified 5 research questions:

  1. What are the most frequently used test types for mobile applications? (Compatibility, Concurrent, Conformance, Performance, Security, Usability)?

  2. Which research issues in mobile application testing are addressed and how many papers cover the different research issues? (Test Execution Automation, Test Case Generation, Test Environment Management, Testing on Cloud, Model Based Testing)?

  3. At what test level have researchers’ studies most frequently? (Unit, Component, Integration, System, Acceptance)?

  4. What is the paper-publication frequency?

  5. Which journals include papers on mobile application testing?

In total, 123 primary studies were selected and classified. The authors summarized the studies which are conducted for mobile app testing, and performed a gap analysis to provide a map of state-of-the-art in automated testing of mobile apps. They conclude that mobile software testing research is open to new contributions. In particular, research on performance testing may provide more further opportunities since there is a lack of studies in this area. The results also indicate immerging research needs in mobile app testing on the cloud to deal with test execution automation, or test environment management for system level functional testing.

Systematic mapping study

This systematic mapping map1 was conducted by 2 authors in 2016. The work is entitled “Quality Assurance of Mobile Applications: A Systematic Mapping Study”, and comprises 13 pages. The authors conducted the systematic mapping to identify approaches which address the issue of quality assurance for mobile applications. They describe approaches based on a test level focus and quality, and addressed research challenges.

To drive the systematic mapping study, the authors specified 7 research questions:

  1. Which testing types of quality assurance approaches exist?

  2. Which testing levels are addressed?

  3. Which testing phases are addressed?

  4. Which qualities are addressed?

  5. Which kinds of automation are implemented?

  6. How are the approaches evaluated?

  7. Which challenges exist?

In total, 230 primary studies were selected and classified. The authors found that mainly system testing is considered, while functional and non-functional properties are addressed during quality assurance with a slightly stronger focus on the former. They also highlight that automation of the testing process plays an important role for mobile-specific quality assurance, especially on GUI level, however, the maturity of such tools is low. For researchers, the results can help to identify further research directions, and motivate to perform more accurate evaluation of the proposed approaches, especially for industrial cases.

Survey

This survey surv1 was conducted by 3 authors in 2009–2011 years. The work is entitled “Obstacles and opportunities in deploying model-based GUI testing of mobile software: a survey”, and comprises 29 pages. The authors conducted this survey to identify possible obstacles and opportunities towards wider deployment of model-based testing approach in industry. The survey results indicate that even not being widely used in industry yet, model-based testing imposes a great interest among mobile software testing professionals. However, there is a need for further research to understand how to efficiently manage model construction during testing since larger models are often impractical. In addition, uniform metrics of tests effectiveness should be developed to enable comparison between different testing approaches. They also highlight that more research attention should be dedicated to developing testing techniques which can do a quick bug localization.

3.2 Motivation and review questions

Motivation

Since beginning of mobile era, various techniques have been proposed and developed for mobile app GUI testing. Some of them are fully automated, while others still rely on the user inputs to a certain extent, e.g., semi-automated and manual. Nevertheless, many developed testing techniques have resulted in the testing tool (executable artefact). In fact, all these tools aim to increase test coverage, optimize model construction, and eventually deliver a solution which can be used in practice. Increasing coverage is explained by a fact that the higher test coverage, the more app functionality is tested, thus enabling the testing tool to potentially discover more app bugs and failures. Also, the more optimal model can be constructed, the more efficient testing tool will be. This aspect is the same critical as the test coverage because modern apps usually have complex code and GUI structures innately yielding large or extremely large models. However, such large models unlikely to be fully (in systematic manner) explored resulting in lower test coverage, otherwise exploration time will grow exponentially and could even be infinite which is impractical.

Test coverage is de facto useful means of showing effectiveness, while optimal model construction shows efficiency of the testing tools. In turn, practicality tightly depends on the effectiveness and efficiency, however it could also be decided up on individual estimation of effectiveness or efficiency. Effectiveness, efficiency, and practicality are 3 fundamental characteristics which testing tools are built upon, and aiming to continuously improve them in order to deliver useful solutions for researchers and practitioners, and community as a whole. To the best of our knowledge, there is no systematic review conducted in the field, which attempts to provide a broad picture of the existing mobile testing tools by collating and analysing their conceptual, and also performance characteristics including an estimation of effectiveness, efficiency, and practicality. Therefore, in this systematic review, we decided to take this challenge, and give an attempt to evaluate various testing tools and their characteristics for automated functional GUI testing of mobile apps.

Review questions

Specifying the review questions (RQs) is the most important part of any systematic review as they drive the entire systematic review methodology. In fact, the critical issue in any systematic review is to ask the right question(s). As such, in this section, we identify primary and secondary RQs which are meaningful and important to researchers, and could also be valuable to practitioners.

To drive our systematic review, we specify 3 primary, and 14 secondary RQs. The primary RQs are as follows:

  1. RQ#1 How effective the proposed GUI testing techniques are in mobile?

  2. RQ#2 How efficient the proposed GUI testing techniques are in mobile?

  3. RQ#3 How practical the proposed GUI testing techniques are in mobile?

To facilitate RQ#1, we specify 6 secondary RQs:

  1. RQ#1.1 What model paradigm is used?

  2. RQ#1.2 What test generation approach is used?

  3. RQ#1.3 What test generation criteria is used?

  4. RQ#1.4 What textual input generation mechanism is used?

  5. RQ#1.5 What code coverage metric is used?

  6. RQ#1.6 What code coverage results are achieved on average?

To facilitate RQ#2, we specify 6 secondary RQs:

  1. RQ#2.1 What testing approach is used?

  2. RQ#2.2 What testing technique is used?

  3. RQ#2.3 What search algorithm is used?

  4. RQ#2.4 What termination condition is used?

  5. RQ#2.5 What app benchmark size is used?

  6. RQ#2.6 What execution time is taken per app?

To facilitate RQ#3, we specify 2 secondary RQs:

  1. RQ#3.1 How effectiveness impacts practicality?

  2. RQ#3.2 How efficiency impacts practicality?

4 Review Methods

In this section, we conduct a systematic literature review of the selected primary studies. We identify data sources and specify a search strategy, perform study selection and quality assessment, data extraction and synthesis in accordance with our developed review protocol (see Section 2).

4.1 Data sources

The aim of a systematic review is to find as many primary studies relating to the research question as possible using an unbiased search strategy. Therefore, the rigorous search process is a crucial and identifying factor for the systematic reviews unlike traditional ones. The first step for searching primary studies can be undertaken using digital libraries, however, in practice, it is not sufficient for a complete systematic review. As such, other relevant sources must also be searched, e.g., reference lists from relevant primary studies and review articles, research and industrial (company) journals, grey literature (i.e., technical reports, white papers, unpublished work, and work in progress), conference proceedings, and the Internet in overall. Also, using various sources for the primary studies search, helps to mitigate a problem in systematic reviews, which is know as publication bias leading to systematic bias in systematic reviews. The publication bias is the problem where positive results are more likely to be published rather than negative ones slrguide . To address the issue with publication bias, we perform an exhaustive search for primary studies.

As suggested by Brereton BRERETON2007571 , there is no single source which can find all the primary studies. Thus, we identify multiple data sources including various digital libraries, academic search engines, individual journals, and conferences. Below, we introduce the selected relevant software engineering digital libraries, academic search engines, journals, and conferences. In this systematic review, the listed data sources are used for the search of primary studies.

4.1.1 Digital libraries

For digital libraries, we identify 11 major data sources which cover software engineering domain, and are relevant to software engineers. We define digital library as a single source where electronic articles can be searched, and downloaded as a full text. The selected digital libraries are listed below.

  1. Research at Google444research.google.com/pubs/papers.html

  2. IBM Research555domino.research.ibm.com/library/cyberdig.nsf

  3. IEEE Xplore666ieeexplore.ieee.org/Xplore/home.jsp

  4. ACM Digital Library (ACM DL)777dl.acm.org

  5. Wiley Online Library888onlinelibrary.wiley.com

  6. SpringerLink999link.springer.com

  7. ScienceDirect101010www.sciencedirect.com

  8. JSTOR111111www.jstor.org

  9. ResearchGate121212www.researchgate.net

  10. arXiv131313arxiv.org

  11. Academia.edu141414www.academia.edu

4.1.2 Academic search engines

For the academic search engines (or simply “search engines”), we identify 8 major search sources. We define search engine as a single interface where electronic articles can only be searched, while providing a link to an external source from where the found article (full text) can be downloaded. The selected search engines are listed below.

  1. Google Scholar151515scholar.google.com

  2. Microsoft Academic (MA)161616academic.microsoft.com

  3. Ei Compendex171717www.engineeringvillage.com

  4. Scopus181818www.elsevier.com/scopus

  5. Web of Science191919www.webofknowledge.com

  6. Inspec202020theiet.org/inspec

  7. CiteSeerX212121citeseer.ist.psu.edu

  8. dblp (Digital Bibliography & Library Project)222222dblp.org

Apart from the digital libraries, and search engines, in our systematic review, we also do a search for primary studies in the individual journals and conference proceedings, thus we justify our selection criteria as follows. We select all potential journals and conferences, aims and scope of which, are in software engineering domain including software quality, testing, validation, verification, and reliability.

4.1.3 Journals

To select journals, we do a manual search in the database Master Journal List from Clarivate Analytics,232323mjl.clarivate.com which includes all journal titles covered by Web of Science. In particular, we use specific search terms such as “computer”, “information”, “technology”, “software”, “quality”, “testing”, “verification”, “validation”, and “reliability” to find journals which include these words in their titles. Note that Clarivate Analytics engine searches for exact string matches of the search terms in the journal titles, so the search terms should not be combined. After performing search using each of the individual search terms, we obtained 848 matching journals. Next, we manually verified relevance each of the journals by going through lists of their titles, and, if necessary, we also did a quick review of their aims and scope. If their titles are too generic, and aims and scope are not stated clearly, we searched articles in the journals with keywords which are specific to our domain. If the search produced results, we reviewed abstracts and conclusions of several found articles of the target journal to confirm its relevance, otherwise, we concluded that the journal was irrelevant. So, after removing irrelevant journals, and duplicates, we obtained 66 relevant journals. In Table 1, we show the search results242424searched in January, 2018 for the journals using the above-specified search criteria.

No. Search term in journals title Total found journals Total relevant journals
1 computer 123 25
2 information 212 29
3 technology 411 7
4 software 27 15
5 quality 46 1
6 testing 15 1
7 verification 1 1
8 validation 1 0
9 reliability 12 1
Total journals 848 80
Total journals excluding duplicates 789 66
Table 1: Search results for journals in Master Journal List from Clarivate Analytics.

4.1.4 Conferences

To select conferences, we do a manual search CORE2018 database from CORE Conference Portal252525portal.core.edu.au/conf-ranks which provides an information about a collection of Computer Science conferences. In particular, we use specific search terms such as “computer”, “information”, “technology”, “software”, “quality”, “test”, “verification”, “validation”, and “reliability” to find conferences which include these words in their titles. Note that CORE Conference Portal engine searches for exact or partial string matches of the search terms in the conference titles. Next, we select those which are recognized as flagship, excellent, or good software engineering conferences. We assume that research works which are published on such venues are likely with high quality of the conducted research and reported research results in comparison with other software engineering conferences. We identify flagship, excellent, and good software engineering conferences using the CORE Rankings Portal.262626core.edu.au/conference-portal

  • A* – a flagship conference, a leading venue in a discipline area.

  • A – an excellent conference, a highly respected venue in a discipline area.

  • B – a good conference, a well regarded venue in a discipline area.

After performing search using each of the individual search terms, and filtering the matching conferences by their rank, we obtained 231 matching conferences with ranks A*, A, and B. Next, we manually verified relevance each of the conferences by going through lists of their titles, and, if necessary, we also did review of their primary topics of interest which are listed on the home web-pages. If their titles are too generic, and topics do not indicate relevance to our systematic review, we searched articles in the conferences proceedings with keywords which are specific to our domain. If the search produced results, we reviewed abstracts and conclusions of several found articles from the latest proceedings of the target conference to confirm its relevance, otherwise, we concluded that the conference was irrelevant. So, after removing irrelevant conference, and duplicates, we obtained 34 relevant conferences. In Table 2, we show the search results272727searched in January, 2018 for the conferences using the above-specified search criteria.

No. Search term in conferences title Total found conferences Total relevant conferences
1 computer 77 9
2 information 59 5
3 technology 26 4
4 software 48 17
5 quality 4 2
6 test 8 3
7 verification 5 4
8 validation 1 1
9 reliability 3 3
Total conferences 231 48
Total conferences excluding duplicates 203 34
Table 2: Search results for A*, A, and B-rank conferences in CORE2018 database from CORE Conference Portal.

4.2 Search strategy

The process of performing a systematic literature review must be transparent and replicable. Therefore, the search process must be documented in sufficient detail so that the readers will be able to assess its thoroughness slrguide . We generate our search strategy which identifies existing systematic reviews, mapping, and potentially relevant primary studies. In particular, we prepared sophisticated search query using Boolean ANDs, ORs, and exact phrase matching, where search query words occur anywhere in the article. Also, for all the data sources, we search (if the option is available in the search source) for the articles which are dated between 2010 and 2018 inclusively. We justify low boundary (i.e., 2010) by the fact that the first releases of the touch-screen-based operating systems (e.g., Android and iOS) have become available to the wide public from the mid-late of 2010s. To construct our search query, we use field-specific, and closely related to the topic of interest keywords which directly target the research area covering in this systematic review. Therefore, we construct a search query which is neither too generic, nor too narrow. As such, we believe that our search strategy finds most of the potentially relevant primary studies, while filtering out irrelevant ones.

Due to the relatively large number of the data sources used in this systematic review, we give an example how we constructed our search query using an instance of Google Scholar search engine. For the other data sources, we apply the same principals of the search strategy using “Advanced search” or “Expert search” option (if available in the search source) with only possible variations in the syntax, and/or search terms (it depends on the search source) of the search query, while preserving the same semantics of the query to ensure the same quality of the search. For example, in Google Scholar, we use “Advanced search” option where we construct our search query as follows mobile graphical OR user OR interface OR box OR functional OR android OR ios OR execution OR systematic OR random OR symbolic OR concolic OR model OR online OR reliability OR verification ”gui testing”. Using this search query, the Google Scholar search engine finds articles (1) with all of the words mobile, (2) with the exact phrase ”gui testing”, and (3) with at least one of the words graphical OR user OR interface OR box OR functional OR android OR ios OR execution OR systematic OR random OR symbolic OR concolic OR model OR online OR reliability OR verification, which may occur anywhere in the articles which are dated between 2010 and 2018 inclusively.

Here we justify our choice of such search strategy. First, we require the word mobile to be in the article since focus of this systematic review is on mobile testing techniques. Second, we require the exact phrase ”gui testing” to be in the article. We piloted our search strategy, and identified that this particular phase is very likely to be used by the authors if their articles are related to the testing of GUIs. Third, we require other words where at least one of them must be in the article. Any of these words can be used in the articles which are relevant to the testing of GUIs. It is important to note that all these words are connected by boolean “OR” making our search strategy more greedy, while all these words are also connected with mobile and ”gui testing” by boolean “AND”, thus targeting most relevant studies.

In Table 3, we list the search results282828searched in January, 2018 for the potentially relevant articles which are yielded by the specified search strategy.

Search source Total sources searched Total found articles
Digital libraries 11 958
Search engines 8 2,124
Journals 66 208
Conferences 34 349
Total 119 3,639
Table 3: Search results for articles using generated search strategy.

4.3 Study selection

In systematic reviews, once the potentially relevant primary studies have been identified, they further need to be assessed for their actual relevance to the topic of interest. For that purpose, study selection criteria are need to be developed to identify such primary studies which provide direct evidence of relevance about the research question(s) raised by the systematic review. Visually, we show the overall process of study selection in Figure 1.

Figure 1: Study selection: overall process of study selection including study search, screening, selection strategy, and quality assessment.

4.3.1 Selection strategy

To select relevant primary studies, we first excluded all articles which have incomplete reference record information, non-English articles, and duplicates (i.e., articles with the same titles) from the found set of 3,639 articles. After filtering, we obtained 1,182 unique (by title) articles with complete reference record information, and all of which are written in English. Next, we identify relevant primary studies by reviewing their abstracts, semantics of titles and keywords. If the abstracts, titles, and keywords do not provide sufficient confidence, we also review conclusions of the target articles to conform their direct relevance to our topic of interest. From the set of 1,182 unique (by title) articles, we also excluded those which publish the same data. In particular, when there are two or more publications of the same data, we include the most complete one, and the others are excluded. So, after applying our general inclusion and exclusion criteria, we obtained 47 unique (by title and data) relevant articles. It is important to note that, from this systematic review, we excluded Patents, Books, Lecture Notes, Keynotes since they innately stay outside of interest for the systematic reviews due to their specific communication style which is not in line with a scientific research design. We also excluded Theses since their data has already been published either on conferences, or in journals.

In Table 4, we provide a full list of general inclusion and exclusion criteria for primary studies selection.

No. Inclusion Exclusion
1 Written in English Other languages
2 Technical* paper Other types of communications or works
3 Conducted for mobile apps Other applications
4 Focus on automated** functional GUI testing Other types of GUI testing
5 Provide complete*** relevant bibliography Incomplete bibliography
  • * this includes full complete technical research papers.

  • ** this includes studies with fully-automatic or semi-automatic testing techniques.

  • *** this includes relevant references from 2007 year inclusively, onwards.

Table 4: General criteria for inclusion and exclusion of primary studies.

4.3.2 Quality assessment

In addition to the general inclusion and exclusion criteria, it is considered critical to assess the “quality” of the selected primary studies. The quality criteria provides more sophisticated details about inclusion and exclusion of the primary studies. However, there is no agreed definition of study “quality” that makes an initial difficulty for the quality assessment. Nevertheless, CRD Guidelines khan2001undertaking and the Cochrane Reviewers’ Handbook cochrane2003 both suggest that quality relates to the extent to which the primary study minimises bias, and maximises internal and external validity. As such, we prepare quality criteria which is aimed at assessing the extent to which primary studies have addressed their bias and validity. It is important to note that when we are forming the quality criteria, we keep in mind that primary studies are often poorly report their results, so theoretically, it may not be possible to determine how to assess their quality, while simply assuming that because something was not reported, it was not actually done. So, as suggested by Petticrew and Roberts petticrew2008systematic , the quality criteria need to address not only the reporting quality, but also the methodological quality of the conducted research. We rigorously developed data quality questions to select most credible, well-designed, complete, and coherent research, so that we ensure that this systematic review analyses only primary studies with a reasonable quality.

Using our quality criteria, we assess quality of 47 relevant articles to confirm their inclusion to, or exclusion from this systematic review. Quality assessment helps to further evaluate the selected primary studies, and conclude to which extent they help to answer the RQs of this systematic review. Note that our quality criteria are used to assist primary studies selection by providing additional details for inclusion and exclusion criteria; we do not use the quality criteria to assist data analysis and synthesis.

In Table 5, we provide a full list of quality criteria for primary studies selection.

No. Quality question
1 Is research problem clearly stated?
2 Does it discover novel aspects not existing in other studies?
3 Is research design properly documented?
4 Is primary study outcome properly described and discussed?
5 Have threats to study validity been discussed?*
  • * at least internal and external threats.

Table 5: Quality criteria for inclusion and exclusion of primary studies.

To more accurately assess the data quality of the primary studies, we constructed a measurement scale for each quality question. For each data quality criterion, we assign weighting coefficient from [0…1] with step 0.1 depending on to which extent this particular study answers the quality question; ‘0’ indicates no data quality, i.e., study does not answer this particular data quality question at all, while ‘1’ indicates excellent data quality for this particular data quality question. For each data quality criterion, the weighting coefficients were independently assigned by 3 authors of this systematic review. Next, the assigned values were averaged to obtain the final weighting coefficient for each quality criterion. By summing the obtained average values, we computed a total weight for each primary study. Upon discussion in our research group, for inclusion of a primary study, we set a minimum threshold for the total weight (quality index) of the primary study. If the total weight of primary study is 2.50 (i.e., 50% of the possible maximum 5.00) or more, we include the study in this systematic review, otherwise we consider that the study is of poor quality, and thus it should be excluded from consideration. If there have been any disagreements, they were resolved during our group meeting with supervisor.

Using our general and quality criteria for inclusion and exclusion of the primary studies, in total, we included 25 primary studies for analysis, while 22 primary studies were excluded. In Table 6, we show the final numbers of included and excluded articles which have been obtained from different publishing venues including Journals, Conferences, Workshops, Symposiums, Technical Reports, Magazines, While Papers, Internet, and Pre-prints. In Table 7, we provide a full list of primary studies which are included in this systematic review for analysis. As shown in Table 7, we describe each study by listing their attributes. While giving the primary studies descriptions, we group and sort them by year of publication in descending order so that it can be easily seen a distribution of the articles by the publication year. Within a group, we sort the articles by their weight, and venue rank (for conferences) or impact factor (for journals) for the same weight, in descending order. In Table 8, we provide a full list of the excluded articles with a rationale for exclusion. We group the articles by their type, and within each group, sort them by venue rank (for conferences) or impact factor (for journals), and publication year for the same rank or impact factor, in descending order. The articles have been excluded for various reasons, mainly because they have a different focus from this systematic review. However, being relevant to the topic of functional GUI testing in mobile, they can be interesting to the reader. So, we maintain a list of the excluded relevant primary studies so that they can be found in “References” section of this systematic review.

Article type Total included articles Total excluded articles
Journal 3 5
Conference* 20 17
Other** 2 0
Total 25 22
  • * this includes Conferences, Workshops, and Symposiums.

  • ** this includes Technical Reports, Magazines, While Papers, Internet, and Pre-prints.

Table 6: Final number of articles included in, and excluded from this systematic review.

* these values indicate a total weight (quality index) of the article; it was obtained by summing the averaged weighting coefficients for each data quality criterion (see Table 5).

** in this column, we use Unknown as we could not identify publishing venue for the article.

*** for conferences, the rank values (A*, A, B, and C) are given in accordance with CORE2018 “portal.core.edu.au/conf-ranks”; for journals, the impact factor (IF) values are given as of February, 2018.

Article Article Article Article Publishing Venue
Reference Weight* Type** Year Venue Rank/IF***
autodroidcomb 5.00 Journal 2018 Information and Software Technology 2.694
androframe 5.00 Unknown 2018 Internet
mobolic 5.00 Journal 2017 Software: Practice and Experience 1.609
stoat 5.00 Conference 2017 Joint Meeting on Foundations of Software Engineering (ESEC/FSE) A*
aimdroid 5.00 Conference 2017 International Conference on Software Maintenance and Evolution (ICSME) A
ehbdroid 4.60 Conference 2017 International Conference on Automated Software Engineering (ASE) A
patdroid 4.00 Conference 2017 Joint Meeting on Foundations of Software Engineering (ESEC/FSE) A*
xdroid 3.33 Conference 2017 Computer Software and Applications Conference (COMPSAC) B
land 3.30 Conference 2017 International Conference on Software Quality, Reliability and Security (QRS) B
droidwalker 2.67 Unknown 2017 arXiv preprint
sapienz 5.00 Symposium 2016 International Symposium on Software Testing and Analysis (ISSTA) A
guicc 4.00 Conference 2016 International Conference on Automated Software Engineering (ASE) A
droiddev 3.50 Conference 2016 Asia–Pacific Software Engineering Conference (APSEC) B
mcrawlt 3.27 Conference 2016

Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD)

C
gat 2.77 Conference 2016 Asia–Pacific Software Engineering Conference (APSEC) B
mobiguitar 3.57 Journal 2015 IEEE Software 2.190
sigdroid 3.00 Symposium 2015 International Symposium on Software Reliability Engineering (ISSRE) A
evodroid 4.00 Symposium 2014 International Symposium on Foundations of Software Engineering (FSE) A*
adautomation 3.73 Conference 2014 International Conference on Software Security and Reliability (SERE) B
swifthand 4.27 Conference 2013 International Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA) A*
collider 4.17 Symposium 2013 International Symposium on Software Testing and Analysis (ISSTA) A
a3e 4.00 Conference 2013 International Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA) A*
orbit 4.00 Conference 2013 Fundamental Approaches to Software Engineering (FASE) B
dynodroid 3.80 Conference 2013 Joint Meeting on Foundations of Software Engineering (ESEC/FSE) A*
acteve 4.87 Symposium 2012 International Symposium on the Foundations of Software Engineering (FSE) A*
Table 7: List of primary studies included in this systematic review for analysis.

* in this column, we use Unknown as we could not identify publishing venue for the article.

** for conferences, the rank values (A*, A, B, and C) are given in accordance with CORE2018 “portal.core.edu.au/conf-ranks”; for journals, the impact factor (IF) values are given as of February, 2018.

Article Article Article Publishing Venue Rationale
Reference Type* Year Venue Rank/IF** for Exclusion
excludeMUTATION Journal 2014 Information and Software Technology 2.694 mutation testing, non-functional
mobolic15 Journal 2017 Journal of Systems and Software 2.444 non-technical research, comparison framework
excludeMANUAL Journal 2014 Journal of Systems and Software 2.444 manual testing
excludeVOG Journal 2017 IEEE Software 2.190 predominant manual, capture-replay approach
excludeIMPACT Journal 2017 Software Quality Journal 1.816 pattern-based testing, non-functional
word2vec Conference 2017 International Conference on Software Engineering (ICSE) A* textual input generation, non-functional
excludeMOBIPLAY Conference 2016 International Conference on Software Engineering (ICSE) A* predominant manual, capture-replay approach
excludeRERAN Conference 2013 International Conference on Software Engineering (ICSE) A* predominant manual, capture-replay approach
sketch Conference 2017 International Conference on Automated Software Engineering (ASE) A new idea, short communication
excludeAPPCHECK Conference 2017 International Conference on Web Services (ICWS) A predominant manual, capture-replay approach
excludeTESTMINER Conference 2017 International Conference on Automated Software Engineering (ASE) A new idea, short communication
excludeATOM Conference 2017 International Conference on Software Testing, Verification and Validation (ICST) A regression testing, non-functional
androframe7 Conference 2016 International Conference on Software Testing, Verification and Validation (ICST) A non-functional
excludeMONKEYLAB Conference 2015 Working Conference on Mining Software Repositories (MSR) A predominant manual, capture-replay approach
quantum Conference 2014 International Conference on Software Testing, Verification and Validation (ICST) A oracle generation, non-functional
excludeTEMA Conference 2011 International Conference on Software Testing, Verification and Validation (ICST) A non-technical research, industrial case study
amoga Conference 2016 International Conference on Advances in Mobile Computing and Multimedia (MoMM) B unsatisfactory article quality
cadage Conference 2015 Computer Software and Applications Conference (COMPSAC) B unsatisfactory article quality
excludeUGA Conference 2014 Asia-Pacific Software Engineering Conference (APSEC) B predominant manual, capture-replay approach
puma Conference 2014 International Conference on Mobile Systems, Applications, and Services (MobiSys) B predominant manual, user-programmable framework
excludeDROIDBOT Conference 2017 International Conference on Software Engineering Companion (ICSE-C) compatibility testing, non-functional
excludeDROIDMATE Conference 2016 International Conference on Mobile Software Engineering and Systems (MOBILESoft) non-functional
Table 8: List of primary studies excluded from this systematic review.

4.4 Data extraction

The objective of this stage is to collect all the information needed to address the RQs. Tabulating the extracted data is a useful instrument of aggregation, so we design data extraction forms (tables) to accurately record the information extracted from the primary studies. Prior to forming final data extraction forms, we piloted them on a sample of primary studies. The pilot studies help to assess the completeness, clarity and structure of the data forms slrguide . For the data extraction, we assigned 2 authors. Each author extracted data independently from all 25 selected primary studies. Next, the filled data forms were collected from the authors, and the provided data was compared. If there have been any conceptional mismatches, they were resolved during our discussion with the other 2 authors of this systematic review. Visually, we show the data extraction process and consensus formation in Figure 2.

Figure 2: Data extraction: overall process of data extraction, and consensus formation.

We structured our tables in such a way helping to highlight similarities and differences between primary studies outcomes in one place. In particular, for every RQ of our review, we prepared a separate table which includes the data relevant to each RQ. Below, we show examples of data extraction forms which include headers of the respective tables, Table 9, Table 10, and Table 11. We explain a functional meaning each of the columns in the tables, based on the definitions provided in utting . First, we describe columns which are common for all the tables. And next, we describe columns which are specific to each individual table.

Here, we describe common columns for all the tables.

  • Article Reference”. It shows a reference number of the primary study which is assigned in this systematic review. So the primary study can be found in section “References” of this systematic review.

  • Artefact”. It shows an acronym of the technique, tool, approach or method which is discussed in the primary study.

Here, we describe columns which are specific to Table 9.

  • Model Paradigm”. It describes nature of the built model. It shows which modelling notations are used to describe the behaviour of the target apps for test generation purposes.

  • Test Generation Approach”. It shows how tests are derived from a built model. It may also combine several approaches to facilitate the non-trivial task of automated test generation from a model.

  • Test Generation Criteria”. It defines test criteria which are used to control the generation of tests. These criteria indirectly define properties of the generated test suites, including their fault detection capability, cardinality, and structural complexity.

  • Textual Input Generation”. It shows which kinds of textual user-inputs can be generated by the artefact. The input generation process itself can be automated or manual, while the input kinds can be random, concrete, predefined, contextual, or others. By default, all the generated inputs are “Automated”. However, if an artefact requires human intervention during the testing process, we indicate it as “Manual”.

  • Code Coverage Metric”. It shows which code coverage metric is used in the primary study. In this systematic review, we extract data only for “basic-block”, and “line (statement)” metrics. If a primary study does not provide such metrics, we indicate “NA”. In this systematic review we use “basic-block”, and “line (statement)” metrics because they are de facto fundamental means of the effectiveness assessment. The other code coverage metrics such as “class”, “method”, “branch”, or model coverage metrics such as “activity”, “transitions”, “states”, “events”, or “sequences of events”, as excluded since they innately cannot give an adequate assessment of the artefact effectiveness.

  • Code Coverage (average value), %”. It shows an average value for the benchmark apps used in a primary study. We computed the average values for the apps which are reported in the primary studies with “basic-block”, and “line (statement)” code coverage metrics. If a primary study does not provide values for such metrics, we indicate “NA”.

  • Effectiveness (relative estimation)”. It shows a relative estimation of effectiveness for the artefact. The relative estimation of effectiveness is given based on the average code coverage values which are extracted from the data provided in primary studies. More details about the effectiveness estimation we provide further in this section.

Here, we describe columns which are specific to Table 10.

  • Testing Approach”. It shows which model is used for the automated testing. We identify 2 main models: one is derived from the “GUI” (user interface flow), and another one is derived from the “Code” (source or binary) of the target app.

  • Testing Technique”. It shows in which manner an artefact performs an exploration of the built model. We identify 2 main approaches: one is “Systematic”, where the artefact implements guided exploration, and another one is “Random”, where the artefact implements random-based exploration strategy of the built model.

  • Search Algorithm

    ”. It shows which algorithm is used, or based on to guide the exploration process. We identify 2 main algorithms: one is “Guided” which uses various model heuristics to guide the search, and another one is “Random” which is based on uniform random and probabilistic events generation.

  • Termination Condition”. It shows a condition which is to be satisfied to terminate the testing process. The termination condition can be determined either by the model properties, or manually by the user.

  • Benchmark Size”. It shows a total number of apps for which the respective data was extracted from a primary study.

  • Execution Time (per app), mins”. It shows minimum, maximum, exact or average time taken for each app from the benchmark used in a primary study. If a primary study does not provide execution time values, we indicate “NA”, or we indicate “N/A” if the execution time is not applicable to the artefact.

  • Efficiency (relative estimation)”. It shows a relative estimation of efficiency of the artefact. The relative estimation of efficiency is given based on the execution time values which are extracted from the data provided in primary studies. More details about the efficiency estimation we provide further in this section.

In Table 11, we collate all the numerical data which is relevant to practicality. This table facilitates RQ#3 by giving a visual analysis of the derived data from Table 9, and Table 10. So we do not repeat descriptions for the matching columns from Table 9, and Table 10, and only describe unique columns which are specific to Table 11.

  • Practicality (relative estimation)”. It shows a relative estimation of practicality of the artefact. The relative estimation of practicality is given based on the averaged effectiveness and efficiency which are extracted from Table 9, and Table 10, respectively. More details about the practicality estimation we provide further in this section.

Article Artefact Model Test Generation Test Generation Textual Input Code Coverage Code Coverage Effectiveness
Reference Paradigm Approach Criteria Generation Metric (average value), % (relative estimation)
Table 9: Data extraction form for RQ#1 and its secondary questions: Effectiveness estimation.
Article Artefact Testing Testing Search Termination Benchmark Execution Time Efficiency
Reference Approach Technique Algorithm Condition Size (per app), mins (relative estimation)
Table 10: Data extraction form for RQ#2 and its secondary questions: Efficiency estimation.
Article Artefact Code Coverage* Execution Time** Effectiveness* Efficiency** Practicality
reference (average value), % (per app), mins (relative estimation) (relative estimation) (relative estimation)
  • * values for these columns will be taken from the respective columns in Table 9.

  • ** values for these columns will be taken from the respective columns in Table 10.

Table 11: Data extraction form for RQ#3 and its secondary questions: Practicality estimation.

In this systematic review, to answer our RQs, we provide a relative estimation of effectiveness, efficiency, and practicality of the automated mobile testing techniques. In fact, there is no clear way how to identify absolute values for effectiveness, efficiency, and practicality. As such, the estimation we give is relative because it is solely based on the data extracted from the primary studies. We estimate effectiveness, efficiency, and practicality by comparing their relevant extracted (or deduced) data for each of the artefacts against a specified range of values (intervals). How the intervals have been decided, we explain below. We also estimate practicality each of the artefacts by averaging the relative values of effectiveness, and efficiency.

As we can see, authors of different primary studies evaluate their techniques using their own benchmark apps, and experimental environments. For example, different primary studies use apps with different code size (lines of code), GUIs and code complexity. Also, different primary studies use different execution environments (e.g., physical mobile device, or mobile emulator), computational resources (e.g., desktop, server, or cloud), and certain techniques may require humans to participate. So, all these variations make it difficult to uniformly give an estimation of the effectiveness, efficiency, and practicality. To find a solution, we make a valid assumption which is based on the fact that the authors of their primary studies choose such benchmark apps, and set-up such experimental environments which are most suited for the purpose of demonstrating an effectiveness, efficiency, and practicality of their proposed testing techniques. As such, every testing technique is expected to show its best performance results. Based on this fact, we believe that our assumption is reasonable and valid, and thus our relative estimation should reflect a true performance of the testing techniques. However, it is important to note that the extracted numerical data values for the effectiveness, efficiency, and practicality estimation are only representatives of the techniques performance. As such, they should not be considered as absolute performance values which are persistent across different apps, and experimental environments.

Effectiveness

Effectiveness is one of the critical characteristics of the model-based testing techniques. It is usually determined by the inferred model, quality of which must be persistent across different testing apps. Based on this fact, we assume that our relative estimation of effectiveness is persistent as well, and must not vary depending on the apps. For example, using high quality models, the effectiveness must not be affected by the apps code size (lines of code), GUIs and code complexity. However, practice shows that there could be a chance when specific apps may affect effectiveness (usually to lower side) due to possible incompleteness of the inferred model.

Based on our practical experience and observations, we conditionally determine the code coverage intervals to estimate an effectiveness of the testing techniques as follows.

  • ●●●●● —  very high (>95%)

  • ●●●● —  high (86%–95%)

  • ●●● —  medium (71%–85%)

  • ●● —  low (51%–70%)

  • ● —  very low (<50%)

It is important to note that these intervals are only valid for the code coverage which is measured in lines (statements), or basic-blocks. In fact, lines (statements) and basic-blocks are closely related to each other code coverage metrics. In particular, one line of code may correspond to several basic-blocks, and otherwise, one basic-block may be included in multiple lines. So, based on this fact and our observations, we found that absolute difference in code coverage between lines and basic-blocks is minimal (i.e., its absolute value varies around 5%), so that the defined intervals can be used for both metrics. However, it is important to note that none of other code coverage metrics should be used on these intervals.

Efficiency

Efficiency is another critical characteristic of the model-based testing techniques. In practice, it is hard to estimate efficiency because it depends on various factors which are usually not persistent. For example, similar to the effectiveness, efficiency also may vary depending on the testing apps code size, their GUI and code complexity, experimental environments. In addition, efficiency may depend on human factors such as user’s programming skills and knowledge, computational complexity of the implemented technique, search algorithm optimization, system design, and others.

Based on our practical experience and observations, we conditionally determine the execution time intervals to estimate an efficiency of the testing techniques as follows.

  • ○○○○○ —  very high (<5 mins)

  • ○○○○ —  high (5–10 mins)

  • ○○○ —  medium (11–20 mins)

  • ○○ —  low (21–35 mins)

  • ○ —  very low (>35 mins)

It is important to note that these intervals are only valid for automated testing techniques. The execution time of manual techniques should not be evaluated on these intervals.

Practicality

Practicality is another critical characteristic of the model-based testing techniques. Generally, practicality depends on the effectiveness and efficiency of the testing techniques. In this systematic review, we consider importance of effectiveness and efficiency to be equivalent. So, we compute practicality relative value as an average of effectiveness and efficiency to give a sense of possible practicality of the testing techniques.

Based on the estimated effectiveness and efficiency, we give a relative estimation of the practicality for each testing technique. For that purpose, we determine 5 base levels to estimate a practicality of the testing techniques as follows.

  • ◐◐◐◐◐ —  very high (likely to be practical)

  • ◐◐◐◐ —  high (may be practical)

  • ◐◐◐ —  medium (could be practical)

  • ◐◐ —  low (may not be practical)

  • ◐ —  very low (unlikely to be practical)

It is important to note that these base levels are only valid for automated testing techniques. The practicality of manual techniques should not be evaluated on these levels.

4.5 Data synthesis

Data synthesis is a process which involves collating and summarising the results of the included primary studies. In our systematic review, since our selected primary studies are qualitative (i.e., descriptive in nature BRERETON2007571 ), we describe their natural language results, numerical, and conclusions. In particular, we use “Line of argument synthesis” approach noblit1988meta , where we first individually analyse each primary study, and next analyse the primary studies as a whole. For that purpose, we fill the tables, Table 12, Table 13, and Table 14, with the respective to the RQs data which is extracted from the selected primary studies. For a better visual representation of the large tables, in Table 12, we group the artefacts by “Model Paradigm” and sort each group by values in “Code Coverage (average value), %” (from highest to lowest); in Table 13, we group the artefacts by “Testing Approach” and sort each group by values in “Execution Time (per app), mins” (from shortest to longest); in Table 14, we sort and group the artefacts by “Practicality (relative estimation)” (from highest to lowest).

By filling each of the tables, we analyse each primary study. We tabulated the extracted data in the manner helping to collate, and summarize the results of the primary studies in order to answer our secondary RQs. The answers to the primary RQs are based on the answers to the secondary RQs. However, answers to the primary RQs cannot be directly found in the primary studies, so the answers to the primary RQs are to be deduced by the authors of this systematic review. Based on the extracted data, we answer our primary RQs in “Results” section by analysing the primary studies as a whole.

* it is based on the low-level atomic genes, and high-level motif genes.

Article Artefact Model Test Generation Test Generation Textual Input Code Coverage Code Coverage Effectiveness
Reference Paradigm Approach Criteria Generation Metric (average value), % (relative estimation)
dynodroid Dynodroid Random–based Random generation Fault detection Random, Line (statement) 55 ●●
(feedback–directed) Manual
autodroidcomb Autodroid Random–based Random generation Fault detection Random, Basic-block 52 ●●
(Frequency) Predefined
xdroid Xdroid Random–based Random generation Fault detection Random, Line (statement) 39
Manual
stoat Stoat Stochastic Random generation Fault detection Random Line (statement) 61 ●●
(based on Gibbs sampling)
aimdroid AimDroid Stochastic Random generation Fault detection Random NA NA
(based on reinforcement
learning)
androframe AndroFrame Stochastic Random generation Fault detection Random NA NA
(based on Q–Matrix)
mobiguitar MobiGUITAR State–based Model–checking Structural model Random, NA NA
coverage User–predefined
mobolic Mobolic Transition–based Model–checking and Structural model Random, Basic–block 92 ●●●●
constraint solving coverage Concrete,
UI-context-aware,
User-predefined
droiddev DroidDEV Transition–based Model–checking Structural model Random, Basic–block 91 ●●●●
coverage UI-context-aware,
User-predefined
orbit ORBIT Transition–based Model–checking Structural model Random Line (statement) 78 ●●●
coverage
mcrawlt MCrawlT Transition–based Model–checking Structural model Random Line (statement) 65 ●●
coverage
land LAND Transition–based Model–checking Structural model Random Line (statement) 58 ●●
coverage
droidwalker DroidWalker Transition–based Model–checking Structural model Random Line (statement) 57 ●●
coverage
autodroidcomb Autodroid Transition–based Model–checking Structural model Random, Basic-block 57 ●●
(Combinatorial) coverage Predefined
guicc GUICC Transition–based Model–checking Structural model Random Line (statement) 43
coverage
swifthand SwiftHand Transition–based Model–checking Structural model Random, NA NA
coverage Predefined
a3e A3E Transition–based Model–checking Structural model Random NA NA
(Depth–first) coverage
adautomation ADAutomation Transition–based Model–checking Structural model Random NA NA
coverage
gat GAT Transition–based Model–checking Structural model Random NA NA
coverage
sapienz Sapienz Genetic–based* Multi–objective Fault detection Random Line (statement) 53 ●●
search–based
(Pareto–optimal)
evodroid EvoDroid Control–flow Search–based algorithms Structural code Random Line (statement) 81 ●●●
(call graph–based) coverage
sigdroid SIG–Droid Control–flow Symbolic execution and Structural code Random, Line (statement) 78 ●●●
(call graph–based) constraint solving coverage Concrete
patdroid PATDroid Control– and Model–checking Structural code Random Line (statement) 60 ●●
data–flow coverage
(permission–aware)
ehbdroid EHBDroid Control–flow Model–checking Structural code Random Line (statement) 57 ●●
(event handler–based) coverage
a3e A3E Control– and Model–checking Structural model Random NA NA
(Targeted) data–flow coverage
acteve ACTEve Control– and Symbolic execution and Structural code Random, NA NA
data–flow constraint solving coverage Concrete
collider Collider Control–flow Symbolic execution and Structural code Random, NA NA
(call graph–based) constraint solving coverage Concrete
Table 12: Extracted data for RQ#1 and its secondary questions: Effectiveness estimation.

* it combines random and systematic exploration strategies.

Article Artefact Testing Testing Search Termination Benchmark Execution Time Efficiency
Reference Approach Technique Algorithm Condition Size (per app), mins (relative estimation)
sigdroid SIG–Droid Code search–based Systematic Depth–first Model exploration completeness 6 3.5 (average) ○○○○○
patdroid PATDroid Code search–based Systematic Breadth–first Model exploration completeness 10 4.3 (average) ○○○○○
ehbdroid EHBDroid Code search–based Systematic Modified Depth–first Model exploration completeness 35 10 (max) ○○○○
(activity–directed)
xdroid Xdroid Code search–based Random Uniform Random Execution time 8 30 (exact) ○○
(based on Xmonkey)
sapienz Sapienz Multi-objective Hybrid* Multi-objective Execution time 68 60 (exact)
code search–based evolutionary search
(based on NSGA–II)
acteve ACTEve Code search–based Systematic Generational search Depth of model exploration 5 71 (average)
a3e A3E Code search–based Targeted Guide search Model exploration completeness 28 88 (average)
(Targeted) (based on taint–tracking)
collider Collider Code search–based Targeted Breath–first Model exploration completeness 5 180 (min)
evodroid EvoDroid Code search–based Systematic Evolutionary search Model exploration completeness or 10 3440 (average)
(step–wise segmented) User–defined
orbit ORBIT GUI model–based Systematic Modified Depth–first Model exploration completeness 8 3.1 (average) ○○○○○
(forward–crawling)
androframe AndroFrame GUI model–based Random Guided search Execution time 100 10 (exact) ○○○○
(QLearning–based)
droiddev DroidDEV GUI model–based Systematic Best–first (informed search) Model exploration completeness 20 16 (average) ○○○
mobolic Mobolic GUI model–based Systematic A* (informed search) Model exploration completeness 10 22 (average) ○○
mcrawlt MCrawlT GUI model–based Systematic Guided search Model exploration completeness 30 43 (average)
(based on Backtracking)
gat GAT GUI model–based Systematic Modified Depth–first Model exploration completeness 9 45 (average)
(gesture-guided)
droidwalker DroidWalker GUI model–based Systematic Depth–first Execution time 20 60 (exact)
aimdroid AimDroid GUI model–based Systematic Breadth–first Execution time 50 60 (exact)
a3e A3E GUI model–based Systematic Depth–first Model exploration completeness 28 104 (average)
(Depth–first)
autodroidcomb Autodroid GUI model–based Systematic Combinatorial search Execution time 10 120 (exact)
(Combinatorial) (based on greedy algorithm)
autodroidcomb Autodroid GUI model–based Random Modified Random Execution time 10 120 (exact)
(Frequency) (frequency–based)
stoat Stoat GUI model–based Random Guided search Execution time 93 180 (exact)

(based on Markov Chain

Monte Carlo sampling)
swifthand SwiftHand GUI model–based Systematic Guided search Execution time 10 180 (exact)
(based on passive learning)
guicc GUICC GUI model–based Systematic Breadth–first Model exploration completeness 10 180 (max)
land LAND GUI model–based Systematic Breadth–first Model exploration completeness 5 180 (max)
adautomation ADAutomation GUI model–based Systematic Depth–first Depth of model exploration 2 1170 (average)
mobiguitar MobiGUITAR GUI model–based Systematic Breath–first Model exploration completeness 4 NA
dynodroid Dynodroid GUI model–based Random Biased Random Number of events 50 N/A
(history–based)
Table 13: Extracted data for RQ#2 and its secondary questions: Efficiency estimation.
Article Artefact Code Coverage Execution Time Effectiveness Efficiency Practicality
reference (average value), % (per app), mins (relative estimation) (relative estimation) (relative estimation)
orbit ORBIT 78 3.1 (average) ●●● ○○○○○ ◐◐◐◐
sigdroid SIG–Droid 78 3.5 (average) ●●● ○○○○○ ◐◐◐◐
droiddev DroidDEV 91 16 (average) ●●●● ○○○ ◐◐◐◖
patdroid PATDroid 60 4.3 (average) ●● ○○○○○ ◐◐◐◖
mobolic Mobolic 92 22 (average) ●●●● ○○ ◐◐◐
ehbdroid EHBDroid 57 10 (max) ●● ○○○○ ◐◐◐
evodroid EvoDroid 81 3440 (average) ●●● ◐◐
mcrawlt MCrawlT 65 43 (average) ●● ◐◖
stoat Stoat 61 180 (exact) ●● ◐◖
land LAND 58 180 (max) ●● ◐◖
droidwalker DroidWalker 57 60 (exact) ●● ◐◖
autodroidcomb Autodroid 57 120 (exact) ●● ◐◖
(Combinatorial)
sapienz Sapienz 53 60 (exact) ●● ◐◖
autodroidcomb Autodroid 52 120 (exact) ●● ◐◖
(Frequency)
xdroid Xdroid 39 30 (exact) ○○ ◐◖
guicc GUICC 43 180 (max)
dynodroid Dynodroid 55 N/A ●●
androframe AndroFrame NA 10 (exact) ○○○○
gat GAT NA 45 (average)
aimdroid AimDroid NA 60 (exact)
acteve ACTEve NA 71 (average)
a3e A3E NA 88 (average)
(Targeted)
a3e A3E NA 104 (average)
(Depth–first)
swifthand SwiftHand NA 180 (exact)
collider Collider NA 180 (min)
adautomation ADAutomation NA 1170 (average)
mobiguitar MobiGUITAR NA NA
Table 14: Extracted data for RQ#3 and its secondary questions: Practicality estimation.

5 Results

In this systematic review we specify 3 primary RQs regarding effectiveness, efficiency, and practicality of the automated GUI testing techniques for mobile apps. In this section, we answer the primary RQs by discussing the extracted data from the primary studies. We use the resultant data which is collated and summarized in Table 12, Table 13, and Table 14 in Section 4.5.

Effectiveness

Effectiveness is one of the important characteristics of automated testing techniques. For the automated testing techniques, effectiveness is usually assessed through the coverage which can be achieved upon completion of tests execution. In particular, code coverage metric is useful means of effectiveness assessment for the automated testing techniques. However, it is practically impossible to give an absolute estimation of effectiveness since all the testing techniques are evaluated on different data sets and testing environments. Thus, in this systematic review, we provide relative estimation of the effectiveness. To estimate the effectiveness, we unify the results of primary studies using code coverage metric, and averaging their code coverage results with respect to the data sets used for the experiments.

We identify 4 conceptual characteristics of the automated testing tools which may affect effectiveness: model paradigm, test generation approach, test generation criteria, and textual input generation. From our observations of the extracted data, we conclude that test generation approach plays a dominant role in effectiveness. It innately identifies model paradigm and test generation criteria, which subsequently impact effectiveness of the testing technique. There is also 1 additional characteristic such as textual input generation which affects effectiveness. Depending on the model paradigm, it may impact effectiveness to a different extent. For example, random-based and stochastic models are less likely to be affected due to their random nature, while deterministic models such as transition- or control-data-based could be affected severely.

From our given relevant estimation of effectiveness, we conclude that testing techniques which implement model-checking, symbolic execution, constraint solving, and search-based test generation approach tend to be more effective than those implementing random test generation. It could be explained by the fact that more sophisticated test generation approaches are likely to be more effective since they usually exploit the built model heuristics to generate high coverage tests.

We observe that random test generation approaches tend to use fault detection as a test generation criteria, while the others mainly focus on structural model or code coverage. It could be explained by the random nature of the testing approaches. In fact, randomly generated tests are likely to expose more bugs due to their unexpected (random) nature, while systematic ones may not hit the same number of bugs unless their coverage is 100%. This is a main reason why systematic testing approaches focus on increasing structural model or code coverage aiming 100%. Otherwise, random techniques will always take the first place being state-of-the-art and practice in the automated functional testing.

We also observe that more sophisticated textual input generation mechanisms help to improve an effectiveness of the automated testing techniques. It could be explained by the fact that mobile apps have highly interactive nature, and thus they are crafted with mindset of being used by humans, not machines. As such, various app behaviours highly depend on the user inputs, textual in particular. So, currently, such app behaviours cannot be effectively tested by automated techniques due to the lack of adequacy in the automated textual input generation.

There are several techniques for which we were not able to identify their effectiveness due to an unavailability of the code coverage metrics which we are based on for the effectiveness estimation. However, we believe that their effectiveness could be the same or similar to those with equivalent characteristics.

Efficiency

Efficiency is another important characteristic of automated testing techniques. For the automated testing techniques, efficiency is usually assessed through an execution time requiring the testing process to complete. However, efficiency of the automated testing techniques may vary depending on various factors such as experimental data sets, execution environment, computation power used for the experiment. As such, it is practically impossible to give an absolute estimation of the efficiency. So, in this systematic review, we give a relative estimation of the efficiency for the automated testing techniques. To estimate the efficiency, we extract from primary studies an execution time per app with respect to the data sets used for the experiments. We estimate efficiency in minutes giving exact, average, minimum, or maximum time taken per app depending on the experimental set-up.

We identify 4 conceptual characteristics of the automated testing tools which may affect efficiency: testing approach, testing technique, search algorithm, and termination condition. From our observations of the extracted data, we conclude that testing approach plays a dominant role in efficiency. Another characteristic such as search algorithm innately identifies testing technique which may subsequently impact efficiency to a different extent. For example, sophisticated search algorithms may require more complex implementation which eventually may slow down the overall efficiency of the testing technique, while simple techniques, such as random, are unlikely to impact the efficiency due to their simplicity. There is also 1 additional characteristic such as termination condition which affects the efficiency. Termination condition is innately identified by the testing technique. Depending on the testing technique, it can be automatically derived from the conditions of the constructed model during runtime. However, it can also be specified manually by a user to indicate when the testing procedure shall stop. For example, systematic testing techniques tend to use automatic termination condition, while random ones are usually rely on the user-specified termination conditions.

From our given relevant estimation of efficiency, we conclude that testing techniques which implement code search-based testing approaches tend to be more efficient than those implementing GUI model-based. It could be explained by the fact that code search-based testing approaches generate tests being guided by more simple and compact models inferred from the app code rather than app GUI. In fact, complexity of the GUI models could be much higher, and their model size could be much larger than those derived from the app code.

We observe that systematic testing techniques tend to be less efficient than random ones. It could be explained by the fact that systematic techniques require more advanced search algorithms to implement a guided search, while random ones usually rely on uniform random. In contrast, systematic techniques use various heuristics of the inferred model to guide the testing process, while random ones do not require any heuristics so that they simply drive the testing process by sampling random actions from the uniform distribution.

We also observe that termination condition which relies on a user-specified condition is less efficient than the one which is automatically derived from the model properties. It could be explained by the fact that if the user specifies a termination condition, it usually implies that the testing approach is not capable of constructing a high quality or deterministic model so that the termination condition cannot be derived from the model properties during runtime. However, if the termination condition is automatically derived from the model properties, it likely implies that the constructed models are of high quality and deterministic enough so that the model properties can be used to automatically determine when to stop the testing process.

There are several techniques for which we were not able to identify their efficiency due to an unavailability of the execution time which we are based upon for the efficiency estimation, or simply because certain techniques do not specify termination condition via execution time, instead they specify number of events which are to be injected during the testing process. Once all the events have been injected, the testing process terminates. However, we believe that their efficiency could be the same or similar to those with equivalent characteristics.

Practicality

Practicality is another important characteristic of automated testing techniques. In general, practicality depends on effectiveness and efficiency. As such, similar to effectiveness and efficiency, practicality may vary depending on the actual application of the technique. Innately, any user wishes to use an automated testing technique with high effectiveness and efficiency. However, it is practically impossible since there is no best single technique which is suited for all the testing purposes.

Based on the estimated effectiveness and efficiency, we observe that the more effective a testing technique is, the less efficient it will be. It seems to be a trade-off between effectiveness and efficiency. So, the user should identify purpose of testing, and select most suitable testing technique for his particular purpose. For example, if a target app has a highly complex GUI, it would be a better fit to use tools which implement (1) guided random, or/and (2) stochastic model-based GUI testing techniques which still can achieve an acceptable level of code coverage within a reasonable time. On the other hand, if the app has simple to medium GUI complexity, the user should choose one of the GUI model-based techniques which perform guided systematic exploration, and commonly achieve high code coverage (of course, it depends on an actual inferred model [size, quality], and its handling) within a reasonable time.

In practice, complex GUIs can also be tested by systematic GUI model-based testing techniques, not only random ones. However, for systematic exploration, their overall exercising time usually grows exponentially which is impractical in most of cases (unless the highest code coverage is a main test target). Also, if app code and GUI are highly complex, the user may choose a tool which implements uniform random GUI testing technique since any other guided testing techniques may take very long time to complete, which could be impractical.

6 Threats to Validity

In this section we discuss main threats to validity of systematic reviews. We believe that such threats are to be minimized in order to increase a quality of the systematic review. So, we identify 5 threats which usually affect systematic reviews.

Search strategy

Search strategy including search string and search resources may affect systematic review by not covering all possible primary studies conducted in the field. In practice, there is always a chance that certain primary studies may not be found. In fact, it is practically impossible to find all relevant to the topic of interest primary studies, however, we did our best to design such the search strategy which finds as many related articles as possible.

Publication bias

To minimize possible publication bias, we conducted independent pilot review of the pre-selected primary studies to identify possible duplication of the results, proposed approaches, implemented techniques, ideas which may bias systematic review conclusion and results. The identified duplicates were excluded from this systematic review, however, we there could be a possibility that certain studies may still overlap giving a certain bias to our results and conclusions.

Human bias

To minimize possible human bias, and have objective conclusions about primary studies reviews and their data quality, rather than subjective ones, we conducted systematic review with help of several people who are the authors of this systematic review. Having several individuals helps to avoid subjective personal opinions, thus driving our systematic review along the objective line which is based on the facts and evidences described in the primary studies, but not personal opinions.

Quality assessment

In our systematic review we intend to use only high quality primary studies, or at least with satisfactory quality. However, there is no defined metric which helps us to identify the level of quality of a primary study. So, to more subjectively assess the quality of the primary studies, we developed our own criteria and scales. We believe that our estimation approach gives a sufficient ground to be trusted since there are also no evidences against. While we included in this systematic review only those studies which satisfy our selection criteria, and other were excluded, there could be a possibility that certain primary studies with satisfactory quality were missed.

Primary studies results validity

During out study selection process, we found that many studies lack of discussion about threats to validity of their techniques and results. As such, we were not able to assess their validity. However to make it possible, we rely on 2 facts such as research design, and results discussion described in the primary studies. These 2 facts are commonly well-written and discussed, thus giving us a certain confidence about possible threats to the results validity. So, to deduce possible threats to validity based on the research design, and results discussion in the primary studies, we reply on our own knowledge in the topic. Such approach allows us to estimate how severe the threats could be so that we predict to which extent they may or may not impact the reported results.

7 Conclusion

From effectiveness, we observe that the existing automated testing techniques are not effective enough, and currently they achieve nearly half of the desired level of effectiveness. As such, there is still a gap which requires further research to improve existing techniques, or develop conceptually new test generation approaches to improve an effectiveness. In addition, we highlight another area for improvement such as automated textual input generation. Our observations show that most of the techniques currently use random text generation with may significantly impact the desired effectiveness, especially for mobile apps. However, automated relevant input generation is not a trivial task, this is why it is still an evolving field.

From efficiency, we observe that current automated testing techniques are not efficient enough. In general, they provide medium-to-low efficiency requiring more than 30 minutes per app. Certain techniques even set several hours per app to more or less adequately explore app functionality. As such, there is still a large gap which requires further research to improve efficiency of the automated testing techniques. In addition, we highlight another area for improvement focusing on the construction of compact yet high quality models, thus reducing their size and complexity. Existing approaches usually infer the models “as are”, which commonly results in large and complex models. Even their quality is usually high, however exploration time may grow exponentially to complete traversing such large models. The exploration time may take very long since existing search algorithms are not well-adapted to perform on such large models where exploration state-space could be infinite.

From practicality, we observe that only nearly half of the existing tools could be used in practice, while the others are not practical due to their low effectiveness and efficiency. As such, most of the automated testing tools are not likely to be used in practice, while most probable practical tools may also lack high performance due to their performance gap which is induced by either low effectiveness or efficiency. So, there is still unresolved difficult practical problem which requires further investigations towards increasing practicality of the automated testing techniques by simultaneously improving their effectiveness and efficiency.

References

References

  • (1) A. Fink, Conducting research literature reviews: from the Internet to paper, Sage Publications, 2013.
  • (2) M. Petticrew, H. Roberts, Systematic reviews in the social sciences: A practical guide, John Wiley & Sons, 2008.
  • (3) B. Kitchenham, Procedures for performing systematic reviews, Keele, UK, Keele University 33 (2004) (2004) 1–26.
  • (4) B. Kitchenham, S. Charters, Guidelines for Performing Systematic Literature Reviews in Software Engineering, in: Technical Report EBSE-2007-01, 2007.
  • (5) K. Khan, R. Kunz, J. Kleijnen, G. Antes, Systematic reviews to support evidence-based medicine, Crc Press, 2011.
  • (6) B. Kitchenham, R. Pretorius, D. Budgen, O. P. Brereton, M. Turner, M. Niazi, S. Linkman, Systematic literature reviews in software engineering – A tertiary study, Information and Software Technology 52 (8) (2010) 792 – 805. doi:https://doi.org/10.1016/j.infsof.2010.03.006.
    URL http://www.sciencedirect.com/science/article/pii/S0950584910000467
  • (7) A. Méndez-Porras, C. Quesada-López, M. Jenkins, Automated testing of mobile applications: A systematic map and review, in: XVIII Ibero-American Conference on Software Engineering, Lima-Peru, 2015, pp. 195–208.
  • (8) K. Holl, F. Elberzhager, Quality Assurance of Mobile Applications: A Systematic Mapping Study, in: Proceedings of the 15th International Conference on Mobile and Ubiquitous Multimedia, MUM ’16, ACM, New York, NY, USA, 2016, pp. 101–113. doi:10.1145/3012709.3012718.
    URL http://doi.acm.org.ezlibproxy1.ntu.edu.sg/10.1145/3012709.3012718
  • (9) M. Sahinoglu, K. Incki, M. S. Aktas, Mobile Application Verification: A Systematic Mapping Study, in: O. Gervasi, B. Murgante, S. Misra, M. L. Gavrilova, A. M. A. C. Rocha, C. Torre, D. Taniar, B. O. Apduhan (Eds.), Computational Science and Its Applications – ICCSA 2015, Springer International Publishing, Cham, 2015, pp. 147–163.
  • (10) S. Zein, N. Salleh, J. Grundy, A systematic mapping study of mobile application testing techniques, Journal of Systems and Software 117 (2016) 334 – 356. doi:https://doi.org/10.1016/j.jss.2016.03.065.
    URL http://www.sciencedirect.com/science/article/pii/S0164121216300140
  • (11) M. Janicki, M. Katara, T. Pääkkönen, Obstacles and opportunities in deploying model-based GUI testing of mobile software: a survey, Software Testing, Verification and Reliability 22 (5) 313–341. arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/stvr.460, doi:10.1002/stvr.460.
    URL https://onlinelibrary.wiley.com/doi/abs/10.1002/stvr.460
  • (12) M. Shafique, Y. Labiche, Guidelines for performing Systematic Literature Reviews in Software Engineering, in: Technical Report EBSE-2007-01, University of Durham, UK, 2007.
    URL https://ees.elsevier.com/infsof/img/525444systematicreviewsguide.pdf
  • (13) G. W. Noblit, R. D. Hare, Meta-ethnography: Synthesizing qualitative studies, Vol. 11, sage, 1988.
  • (14) R. C. Bryce, S. Sampath, A. M. Memon, Developing a Single Model and Test Prioritization Strategies for Event-Driven Software, IEEE Transactions on Software Engineering 37 (1) (2011) 48–64. doi:10.1109/TSE.2010.12.
  • (15) N. Chaudhary, O. Sangwan, Metrics for event driven software, Int. J. Adv. Comput. Sci. Appl.(IJACSA) 7 (1) (2016) 85–89.
  • (16) S. Herbold, P. Harms, AutoQUEST – Automated Quality Engineering of Event-Driven Software, in: 2013 IEEE Sixth International Conference on Software Testing, Verification and Validation Workshops, 2013, pp. 134–139. doi:10.1109/ICSTW.2013.23.
  • (17) S. M. McKain, U. Lobo, J. W. Saunders, Graphical user interface testing, uS Patent App. 11/805,295 (November 2008).
  • (18) R. Mahajan, B. Shneiderman, Visual and textual consistency checking tools for graphical user interfaces, IEEE Transactions on Software Engineering 23 (11) (1997) 722–735. doi:10.1109/32.637386.
  • (19) A. M. Memon, GUI testing: Pitfalls and process, Computer 35 (8) (2002) 87–88.
  • (20) B. A. Myers, User Interface Software Tools, ACM Trans. Comput.-Hum. Interact. 2 (1) (1995) 64–103. doi:10.1145/200968.200971.
    URL http://doi.acm.org/10.1145/200968.200971
  • (21) M. J. Harrold, Testing: A Roadmap, in: Proceedings of the Conference on The Future of Software Engineering, ICSE ’00, ACM, New York, NY, USA, 2000, pp. 61–72. doi:10.1145/336512.336532.
    URL http://doi.acm.org/10.1145/336512.336532
  • (22) S. McConnell, Daily build and smoke test, IEEE software 13 (4) (1996) 144.
  • (23) A. M. Memon, M. E. Pollack, M. L. Soffa, Hierarchical GUI test case generation using automated planning, IEEE Transactions on Software Engineering 27 (2) (2001) 144–155. doi:10.1109/32.908959.
  • (24) X. Yuan, A. M. Memon, Generating Event Sequence-Based Test Cases Using GUI Runtime State Feedback, IEEE Transactions on Software Engineering 36 (1) (2010) 81–95. doi:10.1109/TSE.2009.68.
  • (25) X. Yuan, A. M. Memon, Using GUI Run-Time State As Feedback to Generate Test Cases, in: Proceedings of the 29th International Conference on Software Engineering, ICSE ’07, IEEE Computer Society, Washington, DC, USA, 2007, pp. 396–405. doi:10.1109/ICSE.2007.94.
    URL http://dx.doi.org/10.1109/ICSE.2007.94
  • (26) M. Utting, A. Pretschner, B. Legeard, A taxonomy of model-based testing approaches, Software Testing, Verification and Reliability 22 (5) 297–312. arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/stvr.456, doi:10.1002/stvr.456.
    URL https://onlinelibrary.wiley.com/doi/abs/10.1002/stvr.456
  • (27) M. Utting, B. Legeard, Practical model-based testing: a tools approach, Morgan Kaufmann, 2010.
  • (28) I. K. El-Far, J. A. Whittaker, Model-Based Software Testing, American Cancer Society, 2002. arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/0471028959.sof207, doi:10.1002/0471028959.sof207.
    URL https://onlinelibrary.wiley.com/doi/abs/10.1002/0471028959.sof207
  • (29) M. Broy, B. Jonsson, J.-P. Katoen, M. Leucker, A. Pretschner, Model-based testing of reactive systems, in: Volume 3472 of Springer LNCS, Springer, 2005.
  • (30) A. Pretschner, Model-based testing, in: Proceedings. 27th International Conference on Software Engineering, 2005. ICSE 2005., 2005, pp. 722–723. doi:10.1109/ICSE.2005.1553582.
  • (31) S. R. Dalal, A. Jain, N. Karunanithi, J. M. Leaton, C. M. Lott, G. C. Patton, B. M. Horowitz, Model-based Testing in Practice, in: Proceedings of the 21st International Conference on Software Engineering, ICSE ’99, ACM, New York, NY, USA, 1999, pp. 285–294. doi:10.1145/302405.302640.
    URL http://doi.acm.org/10.1145/302405.302640
  • (32) K. Sen, Concolic Testing, in: Proceedings of the Twenty-second IEEE/ACM International Conference on Automated Software Engineering, ASE ’07, ACM, New York, NY, USA, 2007, pp. 571–572. doi:10.1145/1321631.1321746.
    URL http://doi.acm.org/10.1145/1321631.1321746
  • (33) R. Majumdar, K. Sen, Hybrid Concolic Testing, in: Proceedings of the 29th International Conference on Software Engineering, ICSE ’07, IEEE Computer Society, Washington, DC, USA, 2007, pp. 416–426. doi:10.1109/ICSE.2007.41.
    URL http://dx.doi.org/10.1109/ICSE.2007.41
  • (34) D. Paqué, From symbolic execution to concolic testing.
  • (35) R. Baldoni, E. Coppa, D. C. D’Elia, C. Demetrescu, I. Finocchi, A survey of symbolic execution techniques, arXiv preprint arXiv:1610.00502.
  • (36) P. McMinn, Search-Based Software Testing: Past, Present and Future, in: 2011 IEEE Fourth International Conference on Software Testing, Verification and Validation Workshops, 2011, pp. 153–163. doi:10.1109/ICSTW.2011.100.
  • (37) J. Wegener, A. Baresel, H. Sthamer, Evolutionary test environment for automatic structural testing, Information and Software Technology 43 (14) (2001) 841 – 854. doi:https://doi.org/10.1016/S0950-5849(01)00190-2.
    URL http://www.sciencedirect.com/science/article/pii/S0950584901001902
  • (38) S. Wappler, F. Lammermann, Using Evolutionary Algorithms for the Unit Testing of Object-oriented Software

    , in: Proceedings of the 7th Annual Conference on Genetic and Evolutionary Computation, GECCO ’05, ACM, New York, NY, USA, 2005, pp. 1053–1060.

    doi:10.1145/1068009.1068187.
    URL http://doi.acm.org/10.1145/1068009.1068187
  • (39) D. Adamo, D. Nurmuradov, S. Piparia, R. Bryce, Combinatorial-based event sequence testing of Android applications, Information and Software Technology doi:https://doi.org/10.1016/j.infsof.2018.03.007.
    URL http://www.sciencedirect.com/science/article/pii/S0950584918300429
  • (40) P. Brereton, B. A. Kitchenham, D. Budgen, M. Turner, M. Khalil, Lessons from applying the systematic literature review process within the software engineering domain, Journal of Systems and Software 80 (4) (2007) 571 – 583, software Performance. doi:https://doi.org/10.1016/j.jss.2006.07.009.
    URL http://www.sciencedirect.com/science/article/pii/S016412120600197X
  • (41) K. S. Khan, G. Ter Riet, J. Glanville, A. J. Sowden, J. Kleijnen, et al., Undertaking systematic reviews of research on effectiveness: CRD’s guidance for carrying out or commissioning reviews, no. 4 (2n, NHS Centre for Reviews and Dissemination, 2001.
  • (42) Cochrane Collaboration. Cochrane Reviewers’ Handbook, 2003.
  • (43) Y. Koroglu, A. Sen, O. Muslu, Y. Mete, C. Ulker, T. Tanriverdi, Y. Donmez, QBE: QLearning-Based Exploration of Android Applications.
  • (44) Y. L. Arnatovich, L. Wang, N. M. Ngo, C. Soh, Mobolic: An automated approach to exercising mobile application GUIs using symbiosis of online testing technique and customated input generation, Software: Practice and Experience 0 (0). arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/spe.2564, doi:10.1002/spe.2564.
    URL https://onlinelibrary.wiley.com/doi/abs/10.1002/spe.2564
  • (45) T. Su, G. Meng, Y. Chen, K. Wu, W. Yang, Y. Yao, G. Pu, Y. Liu, Z. Su, Guided, Stochastic Model-based GUI Testing of Android Apps, in: Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2017, ACM, New York, NY, USA, 2017, pp. 245–256. doi:10.1145/3106237.3106298.
    URL http://doi.acm.org/10.1145/3106237.3106298
  • (46) T. Gu, C. Cao, T. Liu, C. Sun, J. Deng, X. Ma, J. Lü, AimDroid: Activity-Insulated Multi-level Automated Testing for Android Applications, in: 2017 IEEE International Conference on Software Maintenance and Evolution (ICSME), 2017, pp. 103–114. doi:10.1109/ICSME.2017.72.
  • (47) W. Song, X. Qian, J. Huang, EHBDroid: Beyond GUI Testing for Android Applications, in: Proceedings of the 32Nd IEEE/ACM International Conference on Automated Software Engineering, ASE 2017, IEEE Press, Piscataway, NJ, USA, 2017, pp. 27–37.
    URL http://dl.acm.org/citation.cfm?id=3155562.3155570
  • (48) A. Sadeghi, R. Jabbarvand, S. Malek, PATDroid: Permission-aware GUI Testing of Android, in: Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2017, ACM, New York, NY, USA, 2017, pp. 220–232. doi:10.1145/3106237.3106250.
    URL http://doi.acm.org/10.1145/3106237.3106250
  • (49) C. Cao, C. Meng, H. Ge, P. Yu, X. Ma, Xdroid: Testing Android Apps with Dependency Injection, in: 2017 IEEE 41st Annual Computer Software and Applications Conference (COMPSAC), Vol. 1, 2017, pp. 214–223. doi:10.1109/COMPSAC.2017.268.
  • (50) J. Yan, T. Wu, J. Yan, J. Zhang, Widget-Sensitive and Back-Stack-Aware GUI Exploration for Testing Android Apps, in: 2017 IEEE International Conference on Software Quality, Reliability and Security (QRS), 2017, pp. 42–53. doi:10.1109/QRS.2017.14.
  • (51) Z. Hu, Y. Ma, Y. Huang, DroidWalker: Generating Reproducible Test Cases via Automatic Exploration of Android Apps, CoRR abs/1710.08562.
  • (52) K. Mao, M. Harman, Y. Jia, Sapienz: Multi-objective Automated Testing for Android Applications, in: Proceedings of the 25th International Symposium on Software Testing and Analysis, ISSTA 2016, ACM, New York, NY, USA, 2016, pp. 94–105. doi:10.1145/2931037.2931054.
    URL http://doi.acm.org/10.1145/2931037.2931054
  • (53) Y.-M. Baek, D.-H. Bae, Automated Model-based Android GUI Testing Using Multi-level GUI Comparison Criteria, in: Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering, ASE 2016, ACM, New York, NY, USA, 2016, pp. 238–249. doi:10.1145/2970276.2970313.
    URL http://doi.acm.org/10.1145/2970276.2970313
  • (54) Y. L. Arnatovich, M. N. Ngo, T. H. B. Kuan, C. Soh, Achieving High Code Coverage in Android UI Testing via Automated Widget Exercising, in: 2016 23rd Asia-Pacific Software Engineering Conference (APSEC), 2016, pp. 193–200. doi:10.1109/APSEC.2016.036.
  • (55) S. Salva, P. Laurençot, S. R. Zafimiharisoa, Model Inference of Mobile Applications with Dynamic State Abstraction, in: R. Lee (Ed.), Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing 2015, Springer International Publishing, Cham, 2016, pp. 177–193.
  • (56) X. Wu, Y. Jiang, C. Xu, C. Cao, X. Ma, J. Lu, Testing Android Apps via Guided Gesture Event Generation, in: 2016 23rd Asia-Pacific Software Engineering Conference (APSEC), 2016, pp. 201–208. doi:10.1109/APSEC.2016.037.
  • (57) D. Amalfitano, A. R. Fasolino, P. Tramontana, B. D. Ta, A. M. Memon, MobiGUITAR: Automated Model-Based Testing of Mobile Apps, IEEE Software 32 (5) (2015) 53–59. doi:10.1109/MS.2014.55.
  • (58) N. Mirzaei, H. Bagheri, R. Mahmood, S. Malek, SIG-Droid: Automated system input generation for Android applications, in: 2015 IEEE 26th International Symposium on Software Reliability Engineering (ISSRE), 2015, pp. 461–471. doi:10.1109/ISSRE.2015.7381839.
  • (59) R. Mahmood, N. Mirzaei, S. Malek, EvoDroid: Segmented Evolutionary Testing of Android Apps, in: Proceedings of the 22Nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2014, ACM, New York, NY, USA, 2014, pp. 599–609. doi:10.1145/2635868.2635896.
    URL http://doi.acm.org/10.1145/2635868.2635896
  • (60) A. Li, Z. Qin, M. Chen, J. Liu, ADAutomation: An Activity Diagram Based Automated GUI Testing Framework for Smartphone Applications, in: 2014 Eighth International Conference on Software Security and Reliability (SERE), 2014, pp. 68–77. doi:10.1109/SERE.2014.20.
  • (61) W. Choi, G. Necula, K. Sen, Guided GUI Testing of Android Apps with Minimal Restart and Approximate Learning, in: Proceedings of the 2013 ACM SIGPLAN International Conference on Object Oriented Programming Systems Languages and Applications, OOPSLA’13, ACM, New York, NY, USA, 2013, pp. 623–640. doi:10.1145/2509136.2509552.
    URL http://doi.acm.org/10.1145/2509136.2509552
  • (62) C. S. Jensen, M. R. Prasad, A. Møller, Automated Testing with Targeted Event Sequence Generation, in: Proceedings of the 2013 International Symposium on Software Testing and Analysis, ISSTA 2013, ACM, New York, NY, USA, 2013, pp. 67–77. doi:10.1145/2483760.2483777.
    URL http://doi.acm.org/10.1145/2483760.2483777
  • (63) T. Azim, I. Neamtiu, Targeted and Depth-first Exploration for Systematic Testing of Android Apps, in: Proceedings of the 2013 ACM SIGPLAN International Conference on Object Oriented Programming Systems Languages and Applications, OOPSLA’13, ACM, New York, NY, USA, 2013, pp. 641–660. doi:10.1145/2509136.2509549.
    URL http://doi.acm.org/10.1145/2509136.2509549
  • (64) W. Yang, M. R. Prasad, T. Xie, A Grey-Box Approach for Automated GUI-Model Generation of Mobile Applications, in: V. Cortellessa, D. Varró (Eds.), Fundamental Approaches to Software Engineering, Springer Berlin Heidelberg, Berlin, Heidelberg, 2013, pp. 250–265.
  • (65) A. Machiry, R. Tahiliani, M. Naik, Dynodroid: An Input Generation System for Android Apps, in: Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2013, ACM, New York, NY, USA, 2013, pp. 224–234. doi:10.1145/2491411.2491450.
    URL http://doi.acm.org/10.1145/2491411.2491450
  • (66) S. Anand, M. Naik, M. J. Harrold, H. Yang, Automated Concolic Testing of Smartphone Apps, in: Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering, FSE ’12, ACM, New York, NY, USA, 2012, pp. 59:1–59:11. doi:10.1145/2393596.2393666.
    URL http://doi.acm.org/10.1145/2393596.2393666
  • (67) L. Deng, J. Offutt, P. Ammann, N. Mirzaei, Mutation operators for testing Android apps, Information and Software Technology 81 (2017) 154 – 168. doi:https://doi.org/10.1016/j.infsof.2016.04.012.
    URL http://www.sciencedirect.com/science/article/pii/S0950584916300684
  • (68) D. Amalfitano, N. Amatucci, A. M. Memon, P. Tramontana, A. R. Fasolino, A general framework for comparing automatic testing techniques of Android mobile apps, Journal of Systems and Software 125 (2017) 322 – 343. doi:https://doi.org/10.1016/j.jss.2016.12.017.
    URL http://www.sciencedirect.com/science/article/pii/S016412121630259X
  • (69) W. Yang, Z. Chen, Z. Gao, Y. Zou, X. Xu, GUI testing assisted by human knowledge: Random vs. functional, Journal of Systems and Software 89 (2014) 76 – 86. doi:https://doi.org/10.1016/j.jss.2013.09.043.
    URL http://www.sciencedirect.com/science/article/pii/S0164121213002392
  • (70) C. W. Hsu, S. H. Lee, S. W. Shieh, Adaptive Virtual Gestures for GUI Testing on Smartphones, IEEE Software 34 (5) (2017) 22–29. doi:10.1109/MS.2017.3641115.
  • (71) I. C. Morgado, A. C. R. Paiva, Mobile GUI testing, Software Quality Journaldoi:10.1007/s11219-017-9387-1.
    URL https://doi.org/10.1007/s11219-017-9387-1
  • (72) P. Liu, X. Zhang, M. Pistoia, Y. Zheng, M. Marques, L. Zeng, Automatic Text Input Generation for Mobile Testing, in: 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE), 2017, pp. 643–653. doi:10.1109/ICSE.2017.65.
  • (73) Z. Qin, Y. Tang, E. Novak, Q. Li, MobiPlay: A Remote Execution Based Record-and-replay Tool for Mobile Applications, in: Proceedings of the 38th International Conference on Software Engineering, ICSE ’16, ACM, New York, NY, USA, 2016, pp. 571–582. doi:10.1145/2884781.2884854.
    URL http://doi.acm.org/10.1145/2884781.2884854
  • (74) L. Gomez, I. Neamtiu, T. Azim, T. Millstein, RERAN: Timing- and touch-sensitive record and replay for Android, in: 2013 35th International Conference on Software Engineering (ICSE), 2013, pp. 72–81. doi:10.1109/ICSE.2013.6606553.
  • (75) C. Zhang, H. Cheng, E. Tang, X. Chen, L. Bu, X. Li, Sketch-guided GUI Test Generation for Mobile Applications, in: Proceedings of the 32Nd IEEE/ACM International Conference on Automated Software Engineering, ASE 2017, IEEE Press, Piscataway, NJ, USA, 2017, pp. 38–43.
    URL http://dl.acm.org/citation.cfm?id=3155562.3155571
  • (76) G. Wu, Y. Cao, W. Chen, J. Wei, H. Zhong, T. Huang, AppCheck: A Crowdsourced Testing Service for Android Applications, in: 2017 IEEE International Conference on Web Services (ICWS), 2017, pp. 253–260. doi:10.1109/ICWS.2017.40.
  • (77) L. Della Toffola, C. A. Staicu, M. Pradel, Saying źHi!ź is Not Enough: Mining Inputs for Effective Test Generation, in: Proceedings of the 32Nd IEEE/ACM International Conference on Automated Software Engineering, ASE 2017, IEEE Press, Piscataway, NJ, USA, 2017, pp. 44–49.
    URL http://dl.acm.org/citation.cfm?id=3155562.3155572
  • (78) X. Li, N. Chang, Y. Wang, H. Huang, Y. Pei, L. Wang, X. Li, ATOM: Automatic Maintenance of GUI Test Scripts for Evolving Mobile Applications, in: 2017 IEEE International Conference on Software Testing, Verification and Validation (ICST), 2017, pp. 161–171. doi:10.1109/ICST.2017.22.
  • (79) K. Moran, M. Linares-Vásquez, C. Bernal-Cárdenas, C. Vendome, D. Poshyvanyk, Automatically Discovering, Reporting and Reproducing Android Application Crashes, in: 2016 IEEE International Conference on Software Testing, Verification and Validation (ICST), 2016, pp. 33–44. doi:10.1109/ICST.2016.34.
  • (80) M. Linares-Vásquez, M. White, C. Bernal-Cárdenas, K. Moran, D. Poshyvanyk, Mining Android App Usages for Generating Actionable GUI-Based Execution Scenarios, in: 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories, 2015, pp. 111–122. doi:10.1109/MSR.2015.18.
  • (81) R. N. Zaeem, M. R. Prasad, S. Khurshid, Automated Generation of Oracles for Testing User-Interaction Features of Mobile Apps, in: 2014 IEEE Seventh International Conference on Software Testing, Verification and Validation, 2014, pp. 183–192. doi:10.1109/ICST.2014.31.
  • (82) T. Takala, M. Katara, J. Harty, Experiences of System-Level Model-Based GUI Testing of an Android Application, in: 2011 Fourth IEEE International Conference on Software Testing, Verification and Validation, 2011, pp. 377–386. doi:10.1109/ICST.2011.11.
  • (83) I. A. Salihu, R. Ibrahim, Systematic Exploration of Android Apps’ Events for Automated Testing, in: Proceedings of the 14th International Conference on Advances in Mobile Computing and Multi Media, MoMM ’16, ACM, New York, NY, USA, 2016, pp. 50–54. doi:10.1145/3007120.3011072.
    URL http://doi.acm.org/10.1145/3007120.3011072
  • (84) H. Zhu, X. Ye, X. Zhang, K. Shen, A Context-Aware Approach for Dynamic GUI Testing of Android Applications, in: 2015 IEEE 39th Annual Computer Software and Applications Conference, Vol. 2, 2015, pp. 248–253. doi:10.1109/COMPSAC.2015.77.
  • (85) X. Li, Y. Jiang, Y. Liu, C. Xu, X. Ma, J. Lu, User Guided Automation for Testing Mobile Apps, in: 2014 21st Asia-Pacific Software Engineering Conference, Vol. 1, 2014, pp. 27–34. doi:10.1109/APSEC.2014.13.
  • (86) S. Hao, B. Liu, S. Nath, W. G. Halfond, R. Govindan, PUMA: Programmable UI-automation for Large-scale Dynamic Analysis of Mobile Apps, in: Proceedings of the 12th Annual International Conference on Mobile Systems, Applications, and Services, MobiSys ’14, ACM, New York, NY, USA, 2014, pp. 204–217. doi:10.1145/2594368.2594390.
    URL http://doi.acm.org/10.1145/2594368.2594390
  • (87) Y. Li, Z. Yang, Y. Guo, X. Chen, DroidBot: a lightweight UI-Guided test input generator for android, in: 2017 IEEE/ACM 39th International Conference on Software Engineering Companion (ICSE-C), 2017, pp. 23–26. doi:10.1109/ICSE-C.2017.8.
  • (88) K. Jamrozik, A. Zeller, DroidMate: A Robust and Extensible Test Generator for Android, in: 2016 IEEE/ACM International Conference on Mobile Software Engineering and Systems (MOBILESoft), 2016, pp. 293–294. doi:10.1109/MobileSoft.2016.066.