GUI testing is notoriously difficult. A GUI test is brittle, a little change to underlying code could break the test. Even worse is a GUI test that passes sometimes and fails sometimes, without any change in the code, tests, or environment. These tests are called flaky tests. Flaky tests occur often due to non-deterministic execution environments. Take an Android app, for example, that needs to connect to the internet. Based on the network stability, connection may take less or more time. If this kind of uncertainty is not properly dealt with, a test may pass when a connection is fast and fail for a slow connection, manifesting the flaky behavior.
Flaky tests frustrate developers and significantly hinder the testing automation. Developers often run tests to verify that their latest changes to a code repository did not break any previously working functionality, i.e., regression testing. Ideally, a test failure would be due to the code changes and developers can focus on debugging these changes. Unfortunately, some test failures are not due to the code change but due to flakiness. Identifying whether a failure is due to code changes or flaky tests may require tremendous efforts and slow down software development. Google statistics111Flaky Tests at Google and How We Mitigate Them, May, 2016:https://testing.googleblog.com/2016/05/flaky-tests-at-google-and-how-we.html show that in practice 84% of transitions from pass to fail involve a flaky test in regression testing, which causes significant drag on software engineers. Facebook recently established the detection of flaky tests as a top research problem for software testing 2.
Today, the technology to detect flaky tests is not mature. Developers face a considerable amount of failures due to flaky tests in regression testing 10. They struggle to distinguish them from the failures that are due to a recently introduced regression. A typical way to detect flaky tests is to rerun failing tests repeatedly (called RERUN). Specifically, developers rerun each failing test multiple times after witnessing the failure. If some rerun passes, the test is marked flaky, otherwise it is marked as unknown. Despite being simple and often used in practice, RERUN is unreliable. Flaky tests are nondeterministic by definition, so there is no guarantee that the outcome of a flaky test will change within multiple reruns. Besides, RERUN is extremely costly: A GUI test typically executes much slower and may take minutes for a single run.
DeFlaker 3 detects flaky tests with differential coverage analysis. A test is marked as flaky if the test fails and at the same time it did not cover any of latest code changes during regression testing. Unfortunately, DeFlaker only works on failing tests and cannot examine whether tests that passed are flaky. Flaky tests that pass in the current regression testing cannot be detected; thus, they remain in the test suite.
In this work, we propose a proactive approach to expose flaky tests of Android apps. Our approach takes a passing test as input and executes it in different possible execution environments. If there is an environment in which the test fails, the test is deemed flaky. With this approach, flaky tests can be exposed and quarantined from a test suite before regression testing occurs. Thus, failures due to test flakiness can be avoided in regression testing. This is with the goal of reducing developer frustrations when they “discover” that a test that they have been using all along is flaky! Instead, we check whether a test is flaky before introducing it into the test-suite.
Observation. Android adopts the single GUI thread model in which events are processed by a UI thread sequentially. To avoid blocking the UI thread for responsiveness, a long-running task, such as accessing the internet, is offloaded to an async thread. Once the task is completed, the async thread updates the result by submitting an event (called an async event) to the UI thread, which results in event racing. Under this model, event execution order varies from run to run due to non-deterministic execution environments. An app might behave normally for most orders but act in an unexpected way for certain orders. This eventually leads to a phenomenon where a test passes sometimes and fails at other times.
Insight. A flaky test can be exposed by exploring the space of event execution orders that may result from different execution environments. Instead of exhaustively running a test in all possible execution environments, we examine a test by exploring the space of feasible event execution orders to check whether there exists an event execution order in which the test fails. A test is deemed flaky if a failure is detected during exploration. Our insight is that Android apps often assign environment-related tasks such as accessing background service or the internet to async threads. App behaviors in different execution environments can be explored by scheduling events.
Challenges. Computing the space of possible event execution orders is difficult. As explained above, event execution orders are determined when async threads submit the async events at runtime. A test run may involve many async threads which are launched from different layers including Android framework, third-party libraries, and the app under test itself. This poses a challenge to track them. Moreover, computing the possible event execution orders involves resolving event dependencies due to thread synchronization. A GUI test often runs in a separated thread maintained by a testing framework such as Espresso 12. Thus, a thorough analysis of the interleavings between the app, the Android framework, and the testing framework is required. Existing techniques for Android apps focus mainly on finding specific concurrency bugs, e.g., CAFA 16 is limited to finding race errors due to use-after-free violations. Additionally, they only support app analysis and provide no support for testing framework.
Solution. We propose a system-level dynamic analysis to resolve thread synchronization dependencies. We run apps in the debug mode such that threads in the whole Android runtime can be monitored and controlled. Thread synchronization dependencies can be resolved by manipulating threads, e.g., suspending a thread to observe others. Besides, a GUI test consists of multiple statements with various GUI operations such as a button click. These operations create multiple events to complete their execution, which leads to a huge space of possible event execution orders. Enumerating all of them is costly, considering a GUI test may run slowly and take several minutes. To address this, for each statement, such as a button click, we group the events it generates. Finally, we schedule the async events between these groups.
Experiments. We evaluate our approach on DroidFlaker which contains 28 widely-used Android apps including Firefox app from Mozilla. We detected 19 out of 24 previously known GUI related flaky tests. We analyzed their root causes and categorized them into three categories. Additionally, we discovered 245 flaky tests that were previously unknown.
Contributions. Our contributions can be summarized as follows.
We propose a proactive approach to expose flaky tests in a test suite. A flaky test can be exposed and quarantined before regression testing so that failures due to flaky tests can be avoided. It is the first approach that automatically detect flaky GUI tests for Android apps.
We develop a technique FlakeShovel which can control threads launched from different layers including apps, Android framework and testing framework, and perform a system level dynamic analysis to precisely resolve dependencies between events.
We collect a subject-suite that contains 28 widely-used apps with GUI tests that are from developers, called DroidFlaker. To facilite future research on flaky tests, we make our tool FlakeShovel and subject-suite DroidFlaker publicly available at: https://github.com/FlakeShovel/FlakeShovel
2. Android concurrency and testing
Figure 1 depicts the Android hybrid event-driven concurrency model as well as the Android GUI testing framework. Every app has a main thread (also called UI thread), which maintains an event queue and a looper associated with the event queue. UI or system events that are generated by users or the system are added into the event queue. The looper dequeues events in a sequential order and dispatches them to corresponding handlers for processing. Android adopts the single-UI-thread model where only the main thread can access the GUI objects. To prevent non-responsive threads from blocking the GUI, long-running tasks such as network access are offloaded to async threads. Once these tasks are finished, async threads post an event marked in blue in Figure 1 to the main thread, which updates results to GUI objects. However, this concurrency model can lead to event racing. As shown in Figure 1, the main thread and async threads run concurrently. When an async thread finishes a task and posts an async event is non-deterministic, depending on the current execution environment. Consequently, the order of events processed by the main thread is non-deterministic as well. In the example shown in Figure 1, there are multiple possible orders which might occur in the execution such as ¡¿ and ¡¿.
Android provides testing frameworks for developers to write GUI tests, which simulate user interactions to exercise app functionalities. GUI tests run on physical devices or emulators and interact with UI interface to generate events. To achieve more reliable tests, testing frameworks provide a set of mechanisms to synchronize test automation interactions with the user interface. For instance, when method onView() is invoked in a test, Espresso waits to perform the corresponding UI action or assertion until the event queue is empty and some async threads (e.g., AsyncTask instance) are terminated and user-defined resources are idling.
3. A Motivating Example
In this section, we use a simple example to explain how a flaky test occurs in Android apps and the challenges in detecting such flaky tests. The example comes from the RapidPro Surveyor app for Android and related code snippets are shown in Listings 1- 3. As we see, the test (Listing 1) first launches an activity (Listing 2) that is used to capture location data of the Android device. When created, the activity connects to the Google API client (line 2, Listing 2) and requests location data of the device. During the connection process, Google API client creates a few worker threads (refer Listing 3) to complete the connection process. Finally, the Google API client accesses the location data and sends it to the activity. Then the test clicks a button on the activity to obtain the location data from the activity. In the end, the test checks the obtained data is not NULL.
Despite being simple, the test is a flaky test. As mentioned before, the test is executed in a testing thread, and the activity runs on the UIThread of the app, and the operation that the Google API client obtains location data is executed on an async thread. Although the test uses onView() to synchronize GUI operations, the testing thread cannot synchronize with the async thread that fetches the location data. Thus, the async thread might update the location data to the activity before or after the testing thread checks the location data. If the checking occurs before the activity receives the location data, the test fails. Otherwise, the test passes. This leads to a phenomenon that the test passes for some times and fails for other times.
Detecting such flaky tests is difficult. First of all, a flaky test is ”hiding” in the test suite and may pass in most execution environments. There is nothing different from other passing tests unless the test fails in the execution. The existing techniques of detecting flaky tests such as DeFlaker 3 apply on failing tests and cannot examine whether a passing test is flaky. Although many existing techniques can detect concurrency bugs, they face challenges to detect a flaky test. As shown in the example, a test in Android apps is often run by a testing framework and many threads might come from the Android system. Detecting flaky tests requires to analyze not only app under test but also the testing frameworks. However, existing techniques typically focus on app analysis. For instance, DroidRacer 24 records execution traces and detects data race in apps by offline analyzing collected execution traces. ERVA 17 takes a data race report which are generated by other tools like DroidRacer to verify whether the reported data race is true positive. AsyncDroid 20 detects bugs in an app by exploring alternative execution orders of event handlers that are created by the app. None of them deals with analysis of testing frameworks and applies to flaky tests detection. This urgent need motivates us to develop a technique that can detect flaky tests for Android apps.
Consider a GUI test consisting of a sequence of program statements ¡ in Figure 2 (a), its one possible execution trace is that consists of a sequence of executed events ¡¿, and is an async event (i.e., is generated by an async thread). For simplicity, we assume only one async event is generated for this test (in general, many async events are generated in a single test execution, which is considered in our approach). As previously stated, event execution order of the test might change for a different run due to non-deterministic execution environments. For instance, event might be executed before or after , depending on how much time is needed for its corresponding thread to complete the task and post the event. Our goal is to compute how many possible event execution orders there are for this test and check whether there exists one order in which the test fails. If a test failure is detected, the test is deemed flaky.
Apparently, the position of in the sequence of events is uncertain for a different test run. However, the space of possible positions of should be constrained between certain two events due to dependencies resulted from thread synchronization. We assume the two events are and . So cannot be executed earlier than or later than , no matter how the execution environment changes (e.g., network connection becomes slow). If and are localized, computing possible event execution orders for the test can be achieved.
Computing space of event execution orders.
In event-driven programming, an event is designed for communication among multiple components and well-encapsulated. Event dependencies are typically handled over to other components like event handlers. Thus, it is challenging to identify such dependencies by capturing and analyzing events themselves. Our idea is to link events to statements in a test since all events are triggered by a test. We execute a test statement by statement and record all events triggered by each statement and build a map between them as shown in Figure 2 (b). As we see, is triggered by statement . Consequently, cannot be executed earlier than the first event that is triggered by . Thus, lower bound of the async event can be identified, i.e., . Localizing upper bound of event involves identifying which events depend on , i.e., events that occur only after is processed. Testing frameworks typically use thread synchronization to guarantee an event occurs before another. For instance in an Espresso test, a statement that invokes onView() method waits until specified threads or resources are idling; otherwise it refuses to be executed. Therefore, event dependency analysis requires to resolve thread synchronization dependencies. As mentioned earlier, traditional program analysis faces challenges to perform such analysis because Android testing often involves many threads from third-party libraries, Android framework, and testing framework. Existing analysis techniques hardly overcome those obstacles, as they are often restricted to analyzing the app code.
Addressing this challenge, we propose a what-if analysis. Specifically, after the test is launched, we hook the async thread that posts at runtime, and suspend the async thread and let other threads free to go. At the same time, we monitor the testing thread and check at which statement of the test it stops and waits for the hooked thread to be completed. Suppose that the testing thread stops at statement , we consider operations in depends on and these operations will not be executed until is processed. Thus, the first event triggered by is upper bound of , that is, has to be executed before . The idea behind our approach is what if it takes forever to compute , operations in a test that depend on will not be triggered due to thread synchronization and those that do not depend on will be executed. So, events that depend on are identified dynamically.
Reducing space of event execution orders.
Since the position of is between and , possible event execution orders for the test can be calculated by moving one position at a time, until reaching , for instance, ¡¿. In practice, the space of event execution orders can be huge because one GUI operation often triggers multiple events at runtime, e.g., one click would generate ”click down” and ”click up” events. One test may trigger hundreds of events, which leads to a huge space of possible event execution orders. Exploring all of them is costly considering a GUI test runs slowly. Thus, we only consider event execution orders in which is located before the first event of a statement, i.e., between and , and between and shown in Figure 2(b). This is reasonable because app behavior is more likely to be influenced when an async event is executed after the execution of a statement is completed.
Suppose, we are exploring an event execution order which in is executed prior to (see Figure 2 (b)). We first query the map between events and statement which is previously generated and identify which statement triggers (in this case it is statement ). Once the test run is launched, we hook the async thread that posts and suspend the thread such that the event cannot be posted. At the same time, we monitor the testing thread and check which statement is being executed by querying the program counter in the Android runtime. When the program counter reaches , we suspend the testing thread and free the async thread that we suspended earlier. After the async thread finishes the task, we free the testing thread to run. In this way, can be executed immediately before the first event of statement .
Figure 3 shows the workflow of our approach. Given a test and app under test, it performs a concrete execution to trace events that the test generates and builds a map between statements in the test and events that are triggered by these statements. Then the approach executes the test multiple times to compute possible schedules for async events and generates a set of event orders that might occur in execution environments. Finally, we explore these possible event execution orders for the test. During the exploration, if a test failure is detected, the test is identified as a flaky test.
5.1. Event tracing and mapping
Event tracing is often used in dynamic analysis of Android apps. It can be achieved by simply logging events that are generated at runtime. However, such techniques cannot fulfill our task. Event information (e.g., event id) produced in logs is dynamically generated at Android runtime and changes in a different run. Our approach requires an event identifier which can be used to identify an event across different test runs. Async events that are identified during event tracing need to be hooked and scheduled in runs which are performed for event order exploration. This poses a challenge for existing techniques.
We identify an event based on interactions between the event and app under test at runtime. Two events that are triggered in different test runs are considered as an identical program behavior if: (1) they are triggered by the same test statement; (2) they are processed by a same sequence of methods at runtime. For instance, a pressDown event is associated with an identifier which is constructed with line number of the statement that triggers the event and signatures of a sequence of methods that process the event. This practice of event identification comes from our investigation of the Android framework. Events are widely used for thread communication and managed by Handlers associated with threads. Events are dealt via different Handlers according to where events come from.
Tracing and mapping.
Algorithm 1 outlines the procedure of event tracing and mapping. It first launches app under test and takes control of Android runtime in which the app runs with a module called ARTHandler. ARTHandler runs the input test in the testing thread and executes statements one by one. When one statement is executed, ARTHandlers monitors the event queue of the UI thread and hooks injected events. For each event, ARTHandler records the tuple where denotes whether it is an async event and denotes the signatures of a sequence of methods that have processed the event. This tuple, along with the line number of the statement that is being executed forms the identifier of the event, which is stored in a list. As stated before, a statement in the test might launch long-running tasks which are executed in async threads. Async events might take long time to be posted. To not miss async events that are triggered by a single statement, we keep hooking events until two criteria are satisfied: (a) there are no new events and (b) the event queue of the UIThread is empty, which often indicates the system is not running tasks. This practice is also used in the Espresso testing framework. A map between statements and events is stored in and returned.
5.2. Identifying event schedule space
To compute possible event execution orders, we perform a what-if dynamic analysis to resolve event dependencies that are caused by thread synchronization in apps and testing frameworks. Algorithm 2 shows the procedure of resolving event dependencies. It takes the event trace generated in the previous step as input. For each async event in , the algorithm launches the test and starts to hook event . Once hooked, the algorithm suspends thread that posts such that can be posted. Meanwhile, it keeps checking status of the testing thread. If the status of the testing thread is WAITING, it considers the testing thread is performing thread synchronization with threads in the app and waiting for to be executed. Thus, we consider the statement that is being executed in the testing thread attempts to trigger an event (saying ) which depends on . Therefore, the schedule space of is bounded by , i.e., the first event that is triggered by . So statement is identified as the upper bound of schedule space of async event . Statement is recorded and set as the upper bound of event . When the upper bound is set, is restored to . In the end, schedule spaces for all async events in are identified and recorded.
5.3. Scheduling events
Schedule space of each async event in the event trace is identified in previous steps. Now we explore event orders during test execution. An async event can be simply represented by a triple ¡¿ where and are bounds of the schedule space of . Specifically, is the index of the statement in the test that triggers , and is the index of the statement that triggers the upper bound event of .
Similar to schedule space identification, we can schedule by operating threads. We first hook event after the test is launched and suspend the thread that posts . Then, we free the testing thread and monitor whether the statement that is being executed is statement . Once statement is reached, we suspend the testing thread and free the suspended thread to post . After the async thread is terminated or idling, i.e., event has been posted, we free the testing thread. In such a way, event can be executed prior to statement . In next test run, we schedule to be executed prior to statement , until all statements between and are explored. This procedure is repeated for each async event in so that the space of possible event execution orders can be systematically explored for the test.
In Algorithm 2, each async event requires one test run for schedule space identification. If async events are generated for a test, we need to run the test times to identify the space of event execution orders, which is costly. To address this issue, we perform an optimization on schedule space identification. When the schedule space of an async event is identified, instead of terminating the execution, we continue executing the test to identify schedule spaces for subsequent async events so as to reduce the number of test runs. In this case, we may suspend multiple async threads at the same time and release more than one async threads to resolve a thread synchronization dependency, which results in a group of async events having the same upper bound (latest time). We only rerun the test for this case to identify schedule space of each of them. Thus, the total number of test runs can be significantly reduced.
Our system is implemented in Scala and runs on a computer that connects a physical Android device or an emulator. Unlike existing techniques, it requires no instrumentation on apps or the Android framework, nor any modification to the Android framework, and can be easily adapted to different versions of Android.
Taking control of Android Runtime.
We leverage the Android debug mode to control the Android runtime. The Android framework allows to run an app in the debug mode. In this mode, we interact with the Android runtime using ADB to remotely monitor the app state and manipulate the thread executions, e.g., performing the execution step by step.
Android adopts the event-driven model, in which each app has an event queue for storing events that occurred and processes them one by one. In the debug mode, we are allowed to set a breakpoint at method enqueueMessage() which is in charge of enqueuing events. Whenever the method is invoked, our system is informed and performs predefined operations such as suspending the event-posting thread. In such a way, our system can hook any event that occurs in the Android runtime before the event is posted into the event queue.
In the debug mode, we are allowed to inspect threads that are running in the Android runtime and check their statues and operate them by sending commands, e.g., sending a command to release a suspended thread. So the UIthread and testing thread can be identified during a test execution. A breakpoint is inserted at each test statement such that we can fully monitor and control the testing thread including querying the index of statement that is being executed and executing step by step. We also can examine the stack frames of a thread to check executed methods in the thread. Such data is used to identify an event.
We perform evaluation on the effectiveness of FlakeShovel in detecting flaky tests that reside in test suites of real world Android apps. Our evaluation aims to address the following research questions:
Can FlakeShovel examine and detect known flaky tests?
How does FlakeShovel compare with existing techniques in terms of number of detected flaky tests?
Can FlakeShovel be used to discover new flaky tests in apps?
7.1. Subject apps
Android app testing has been heavily explored. There are various benchmarks used in evaluating the effectiveness of automated testing of Android apps such as AndroTest 8
which contains 68 open source Android projects and the benchmark29 from Wang et al. which contains 68 industrial Android apps. However, few apps from those benchmarks come with a test suite from developers, let alone GUI tests. On the other hand, test flakiness is an urgent challenge, especially for GUI tests. Unfortunately, there are no Android app benchmarks to support test flakiness research.
Given the pressing need, we developed the first subject-suite DroidFlaker which is used to study GUI test flakiness. It contains 28 widely-used Android apps including Mozilla Firefox Lite and WordPress as shown in Table 1. There are more than 5000 Android instrumentation tests from developers that run on physical devices and emulators.
The challenge we face in building this data set is that publicly available Android projects rarely have tests from developers. To overcome this challenge, we collect Android projects with the following strategies. First, we search well-known Android projects like Firefox in Github and select any project in which there exist tests under folder “../src/androidTest”. Second, we search the label “@FlakyTest” in Github and on the Google website, and select any Android project in which at least one of instrumented tests is labelled “@FlakyTest”. Tests labelled “@FlakyTest” in a test suite are flaky tests reported by developers.
7.2. Experiment setup
We conduct two studies to answer above research questions. Study 1 addresses RQ1 and RQ2 and study 2 addresses RQ3. For study 1, we evaluate FlakeShovel on GUI tests in DroidFlaker (that have been annotated as flaky tests by developers) to check whether FlakeShovel can detect such flaky tests. First, we select all tests in the benchmark that are marked as flaky and exclude tests that are not GUI tests (e.g., tests for database operations), then execute them on Android emulators. The passing tests are used in study 1. In the end, 24 GUI tests are collected from 6 apps, which is shown in Table 2. FlakeShovel and RERUN execute each of them. If a test failure occurs during execution, the flaky test is considered to be successfully detected. The results of each test is recorded for analysis. For study 2, we exclude tests that are used in study 1 and execute the remaining Android instrumentation tests on emulators and the passing tests are selected for evaluation. Eventually, 1444 tests are obtained from the 28 apps and are executed by FlakeShovel. If a test failure is detected, FlakeShovel reports the test as a flaky test.
We conducted experiments on a physical machine with 64 GB RAM and a 56 cores Intel(R) Xeon(R) E5-2660 v4 CPU, running a 64-bit Ubuntu 16.04 operating system. Each execution instance runs in a Docker container to minimize the potential inference between running instances. App under test runs on an Android 9 (x86) emulator. One execution instance is for one test case for which the Android emulator is initialized to a fresh state at the begining to provide a clean testing environment.
7.3. RQ1: Efficacy
|Test Id||App:Framework||Method name||FlakeShovel||RERUN|
Table 2 shows results of FlakeShovel on the data set of known flaky tests. The first column indicates test Ids, the second column shows app names and testing frameworks used in apps, and the third column indicates test method names. Column ”#Op” represents how many statements in a test which perform thread synchronization during testing. This is computed by manually counting synchronization operations such as waitFor(), await() as well as onView(), onData() in the Espresso framework. Column ”#Events” indicates the number of events observed by FlakeShovel during detection. Column ”#Run” shows times the number of times a test is executed for flakiness detection. Column ”Time” reports the time that is used to detect a flaky test. Column ”Ctg” indicates which category the flaky test belongs to in root cause analysis (we identify four categories C1-C3, as mentioned in the following). Column ”Succ” indicates whether the test is identified as a flaky test by FlakeShovel.
As we see in Table 2, FlakeShovel successfully detected 19 flaky tests. For test 6, 7, and 8, FlakeShovel could execute them but failed to identify them as flaky tests. Code inspection shows these tests are flaky due to using a unsophisticated synchronization mechanism, i.e., waiting for a fixed amount of time for asynchronization operations. These 3 tests extract meta data for given videos on the internet and specify 10 seconds waiting for accessing the internet. If it takes more than 10 seconds to connect the internet, they will fail. FlakeShovel can monitor thread synchronization between testing frameworks and apps and stops delaying async events once this synchronization occurs. Thus, these tests passed without being identified as flaky tests. For test 11 and 24, FlakeShovel failed to execute them due to configuration issues, e.g., preview_isShowing from CameraView app passes in API 21 and fails for any of the later APIs: This test usage a view called TextureView, which was updated in later Android APIs.
Testing frameworks often provide various mechanisms to avoid test flakiness, e.g., Espresso uses method onView() to synchronizes view operations with the UIThread. Tests in Table 2
are developed with such test frameworks. Why are they still flaky? To answer this question, we perform an empirical study on root causes of these flaky tests. The root causes are classified into 3 categories:
Category 1 (C1): Tests are flaky due to non-deterministic execution environments. Apps often interact with background services or resources and exchanges data. For some reason (e.g., being used in other computation), these services or resources may be unable to respond in time, which leads to a Timeout exception in the UI thread or testing frameworks and causes a test failure eventually. As shown in Table 2, it is the most common root causes and 12 tests belong to this category. Unfortunately, testing frameworks cannot handle such asynchronism that occurs in the execution environment though they provide mechanisms to synchronize GUI operations or pre-defined resources.
Category 2 (C2): A test expects an implicit event execution order which may not always occur in the execution. Events are not only used for data exchange between threads but also used to perform operations, e.g., an intent is often used to launch an activity. An event execution order change resulted from async threads can lead to a different app behavior such as the soft keyboard disappearing late, which leads to a test failure. In our study, 8 cases belong to this category.
Category 3 (C3): Flaky tests are caused by data race between the testing thread and threads in apps. In many cases, data that is used to check app behavior by a test is produced asynchronously, i.e., by a background thread. The data can be updated late for sometimes and the test checks ”old” data, which lead to a test failure. We have 4 such as cases in our study.
In summary, despite testing frameworks’ support to eliminate flakiness, flaky tests still occur due to non-determinism from execution environments (C1 and C2) and developers omitting certain cases resulted from thread concurrency (C3).
7.4. RQ2: Comparison with existing techniques
Comparison with RERUN
RERUN is widely used to examine whether a test is flaky. A failed test is deemed flaky if the test passes in multiple reruns. The approach works in our setting as well. A passing test is deemed flaky if the test fails in multiple reruns. We take RERUN as the base line tool and compare it with FlakeShovel. In our experiment, we rerun a test for 20 times. If a failure is detected, the test is identified as a flaky test and the execution is terminated.
As shown in Table 2, for most cases, FlakeShovel successfully detects a flaky test during the second run. Totally 19 flaky tests are detected and all of them are detected less than 3 minutes. RERUN successfully detects 6 flaky tests. Two of them are detected in the first run. The others are detected in 4 runs. In terms of execution time of a single test, RERUN runs faster than FlakeShovel since FlakeShovel takes time for dynamic analysis. However, FlakeShovel detects much more flaky tests than RERUN.
With regards to the timing comparison with RERUN, note that RERUN was run only 20 times per test. As a result, when we report that RERUN took 127 seconds on average, it is a gross underestimation of the actual time taken by RERUN.
Overall FlakeShovel detects most flaky tests in the second run i.e., identifying event schedule space phase. The experiments also show that our optimization on schedule space identification is effective. Schedule space identification involves maximally delaying an async event, which most likely triggers a test-flaky failure. We compute schedule spaces of multiple async events at the same time. A flaky test most likely fails in this phase. Therefore, this strategy can significantly reduce the number of runs to detect a flaky test.
Comparison with race-detection techniques.
Event race detection techniques DROIDRACER 24, ERVA 17, EVENTRACER 5, and CAFA 16 run on a modified Android framework 4.3 or 4.4. We face incompatibility issues to evaluate those techniques on the collected apps because many apps target Android frameworks with a higher version (we use Android framework 9.0 in the evaluation). Therefore, we perform a qualitative comparison analysis between FlakeShovel and race-detection techniques.
False positives. Flaky test detection leveraging data race detection techniques will depend on an accurate computation of the happens-before relation. However, data race detection techniques capture happen-before relations by monitoring event operations in UIThread. Event dependencies due to synchronization between testing framework (like Espresso) and UIthread (app under test) will not be captured. This leads to an underestimation of the happens-before relation. As an example event e1 can denote a button appearing in the screen due to an async thread completing computation, event e2 can be a button click in testing framework, and e1 happens-before e2. Such happens-before edges are dropped by race detectors. Flaky test detection leveraging data race detectors may lead to many false positives among the flaky tests reported. In contrast, FlakeShovel can precisely detect synchronization between testing framework and UIthread to avoid such false positives.
Maintainability. Race-detection techniques often require a modified Android framework to capture happen-before relations and can struggle from fast evolution of Android frameworks. By contrast, FlakeShovel can be used in different Android framework versions since FlakeShovel requires no modification to Android framworks and the debug mode that FlakeShovel relies on is supported by most Android frameworks.
7.5. RQ3: Real-world flaky detection
To validate the effectiveness of FlakeShovel on discovering new flaky tests, we ran FlakeShovel on 1444 tests in DroidFlaker which are not marked as flaky. The results show FlakeShovel is effective in discovering new flaky tests. Out of 1444 tests, FlakeShovel successfully detected 245 flaky tests. Figure 4 shows distribution of the detected flaky tests among apps. FlakeShovel discovered the most flaky tests in app My Expenses (61 flaky tests) and less flaky tests in FireFox. We also collected statistics on error messages of the failures of these flaky tests, which is shown in Table 3. The most common error message is ”Waited for the root of the view hierarchy to have window focus” and the next is ”No views in hierarchy found matching”. In other words, most failures are related to mismatch between GUI operations and app state.
7.5.1. Manually Checking Ground Truth
To further validate the detected results, we manually investigated 20 randomly selected cases among 245 reported tests. This is to manually check whether the tests are actually flaky.
The investigation shows these tests usually pass, however, they fail only for one or some corner cases. We leveraged the event orderings discovered by FlakeShovel to identify these corner cases. Two of the authors then studied the effect of those event orders and verified that these tests were indeed flaky. As it turns out during this manual process the two authors were in agreemnt and there was no disagreement which needed to be resolved. We then reported these cases to the corresponding developers with the detailed reports on how to reproduce them.
At the time of writing of the paper, we got 11 out of the 20 test cases confirmed as flaky tests. Five tests are still under investigation by developers, and we did not hear back from developers for four tests.
7.5.2. Case Study: FirefoxLite– SwitchSearchEngineTest
Figure 5 shows parts of code snippet from the SwitchSearchEngineTest test for the FirefoxLite app. It tests the functionality provided by a broadcast receiver SearchEngineManager, which initializes and loads different search engines. During the startup process, search engines are loaded by loadSearchEngines method. This method creates a worker thread (SearchEngines-Load) to initiate the loading process in the background. Loading of search engines is verified (in main thread) via awaitLoadingSearchEnginesLocked.
Evidently, we have three asynchronous threads: the espresso thread, the UIthread, and the worker thread. Worker threads in Android are implicitly moved into a background control group (cgroup), where they only get a small percentage of the available CPU 15. In the scenario, where the worker thread has not started (waiting to get scheduled) and the espresso thread tries to access the default search engine, the above test fails. This depends on various factors such as the percentage of the CPU available and the number of background threads running. FlakeShovel detected the test as a flaky test by exploring different event execution orders. Since there is no synchronization between the testing thread and the worker thread, FlakeShovel delayed the worker thread at event schedule space identification phase such that the test failed and was reported the test as a flaky test.
We identify the following potential limitations to our evaluation.
Event identification. In our system, an event identifier is generated with data from two different threads. We may not successfully hook the event with its identifier for some rare cases, e.g., when our system runs extremely slowly, two pieces of data may not match the event identifier at the same time. Addressing this, we run our experiments on a system with a light workload and configure an emulator with a larger amount of memory (i.e., 8G).
Thread dependency resolution. In event schedule space identification, when too many threads are suspended at the same time, we terminate the execution and resolve thread dependencies to avoid imprecise schedule space identification.
Empirical study. During our manual analysis on flaky tests, at least two of the authors analyze the log of a test failure for each flaky test to ensure the root cause of the test is correctly understood.
9. Related Work
Flaky test detection and fixing. A few earlier researchers have started to work on flaky test issues. Bell et al. 3 use code coverage differential analysis to identify flaky tests. A test is deemed flaky if it fails in the regression testing and its execution does not reach any code that was recently changed by developers. Shi et al. 26 propose an approach to fix order-dependent flaky tests by leveraging passing tests. Shi et al. 25 propose to rerun a test multiple times on each mutant and obtain reliable coverage results such that the effects of flaky tests on mutation testing can be mitigated. Different from them, FlakeShovel detects concurrency-related flaky tests in Android apps by exploring feasible event execution orders.
Event race detection. Another branch of works that are close to ours is event race detection. Instead of detecting flaky tests, they leverage dynamic and static analysis to detect harmful event race. For instance, DROIDRACER 24, ERVA 17, EVENTRACER 5, CAFA 16, and nAdroid 14 capture happens-before-relation among events and inference possible event race errors. In addition, Ozkan et al. 20 propose to detect asynchronous bugs by exploring different execution orders of event handlers in Android apps. These techniques have potential to apply to flaky test detetion, but face challenges to capture complete and precise happpen-before relations when a test is executed by a testing framework like Espresso. Many false positives can be reported due to incomplete happen-before relations as explained in Section 7.4. By contrast, FlakeShovel performs a system-level dynamic analysis to capture precise event dependencies to avoid such false positives.
Empirical studies on flaky tests. Multiple studies 23, 10, 27 confirm concurrency as the major cause of flaky tests. Luo et al. 23 performed an empirical analysis of flaky tests in 51 open-source projects. They identified Concurrency and Async wait as the most common cause of flaky tests. They pointed out that the majority of these cases arose because they do not wait for external resources. Finally, they described the common fixing strategies the developers use to fix flaky tests. In a separate study, Eck et al.10 surveyed 21 professional developers to classify 200 flaky tests they fixed. They identified four unreported causes of flaky tests, which are also considered difficult to fix. Thorve et al. 27 conducted an empirical study of flaky tests in Android apps. They searched 1000 projects for the commits related to flakiness and found only 77 relevant commits from 29 projects. They found 36% of commits occurred due to concurrency related issues. Fan et al. 13 proposed a hybrid approach towards manifesting asynchronous bugs in Android apps. They studied 2097 apps and identified three async programming rules implied by the single-GUI-thread model. Based on these rules, they categorized three fault pattern and used static analysis to locate them in the app. Subsequently, they map these program traces to real event sequences to verify these errors.
Flaky tests pose a significant problem in validating mobile apps. Recent studies 23, 10, 27 have shown concurrency as the most common cause of flaky tests. The uncertainty in a test outcome may arise due to synchronization issues originating from multiple threads interacting in a undesirable manner. In this paper, we presented an approach for detecting flaky tests through a systematic exploration of event orders. We introduced FlakeShovel, a tool to detect flaky tests for Android apps. FlakeShovel explores the space of all realizable execution environments where relevant threads interleave differently.
Due to the lack of a testing benchmark for flaky tests, we created the first subject-suite DroidFlaker that is used to study GUI test flakiness. DroidFlaker contains 28 widely-used Android apps with 2.5k stars on average in GitHub. We applied FlakeShovel to tests from DroidFlaker. Results show that FlakeShovel not only detected known flaky tests but also reported 245 new flaky tests. We believe that our tool and results hold out promise for the problem of tackling flaky tests, which is a significant pain point in industrial practice.
- Repairing event race errors by controlling nondeterminism. In 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE), pp. 289–299. External Links: Cited by: §9.
- Deploying search based software engineering with sapienz at facebook. In Search-Based Software Engineering - 10th International Symposium, SSBSE 2018, Montpellier, France, September 8-9, 2018, Proceedings, pp. 3–45. External Links: Cited by: §1.
- DeFlaker: automatically detecting flaky tests. In 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE), pp. 433–444. Cited by: §1, §3, §9.
- Input-covering schedules for multithreaded programs. ACM SIGPLAN Notices 48, pp. 677–692. External Links: Cited by: §9.
- Scalable race detection for android applications. ACM SIGPLAN Notices 50, pp. 332–348. External Links: Cited by: §7.4, §9.
- Verifying robustness of event-driven asynchronous programs against concurrency. Cited by: §9.
- Efficient detection of thread safety violations via coverage-guided generation of concurrent tests. In IEEE/ACM International Conference on Software Engineering (ICSE), pp. 266–277. Cited by: §9.
- Automated test input generation for android: are we there yet? (e). In Proceedings of the 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE), ASE ’15, pp. 429–440. External Links: Cited by: §7.1.
- Node.fz: fuzzing the server-side event-driven architecture. In European Conference on Computer Systems (Eurosys), pp. 145–160. Cited by: §9.
- Understanding flaky tests: the developer’s perspective. In 27th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), pp. 830–840. Cited by: §1, §10, §9.
- Delay-bounded scheduling. In Proceedings of Symposium on Principles of Programming Languages (POPL), pp. 411–422. Cited by: §9.
-  (2020) Espresso. External Links: Cited by: §1.
- Efficiently manifesting asynchronous programming errors in android apps. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering (ASE), Cited by: §9.
- NAdroid: statically detecting ordering violations in android applications. In Proceedings of the 2018 International Symposium on Code Generation and Optimization (CGO), pp. 62–74. Cited by: §9.
- Exploring android thread priority. External Links: Cited by: §7.5.2.
- Race detection for event-driven mobile applications. ACM SIGPLAN Notices 49. External Links: Cited by: §1, §7.4, §9.
- Automatically verifying and reproducing event-based races in android apps. In International Symposium on Software Testing and Analysis (ISSTA), pp. 377–388. Cited by: §3, §7.4, §9.
- Random testing for higher-order, stateful programs. In Proceedings of the Conference on Object-Oriented Programming Systems, Languages, and Applications, OOPSLA, pp. 555–566. Cited by: §9.
- A platform for search-based testing of concurrent software. In PADTAD 2010 - International Workshop on Parallel and Distributed Systems: Testing, Analysis, and Debugging, pp. 48–58. Cited by: §9.
- Systematic asynchrony bug exploration for android apps. In International Conference on Computer Aided Verification (CAV), pp. 455–461. Cited by: §3, §9.
- Analysis and testing of concurrent programs. Information Sciences and Technologies Bulletin of the ACM Slovakia 5 (3), pp. 1–8 (english). External Links: Cited by: §9.
- Dthreads: efficient deterministic multithreading. In SOSP’11 - Proceedings of the 23rd ACM Symposium on Operating Systems Principles, pp. 327–336. Cited by: §9.
- An empirical analysis of flaky tests. In International Symposium on Foundations of Software Engineering (FSE), pp. 643–653. Cited by: §10, §9.
- Race detection for android applications. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’14, New York, NY, USA, pp. 316–325. External Links: Cited by: §3, §7.4, §9.
- Mitigating the effects of flaky tests on mutation testing. In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA), pp. 112–122. Cited by: §9.
- IFixFlakies: a framework for automatically fixing order-dependent flaky tests. In Proceedings of the 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC-FSE), pp. 545–555. Cited by: §9.
- An empirical study of flaky tests in android apps. In International Conference on Software Maintenance and Evolution (ICSME), pp. 534–538. Cited by: §10, §9.
- Verifying concurrent programs by memory unwinding. In 21st International Conference on Tools and Algorithms for the Construction and Analysis of Systems (TACAS), Cited by: §9.
- An empirical study of android test generation tools in industrial cases. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, ASE 2018, New York, NY, USA, pp. 738–748. External Links: Cited by: §7.1.
- Maple: a coverage-driven testing tool for multithreaded programs. In Proceedings of the Conference on Object-Oriented Programming Systems, Languages, and Applications, OOPSLA, pp. 485–502. Cited by: §9.