Crowdtesting is an emerging trend in software testing which accelerates testing process by attracting online crowdworkers to accomplish various types of testing tasks [1, 2, 3, 4, 5], esp. in mobile application testing. It entrusts testing tasks to crowdworkers whose diverse testing environments/platforms, background, and skill sets could significantly contribute to more reliable, cost-effective, and efficient testing results.
Trade-offs such as “how much testing is enough” are critical yet challenging project decisions in software engineering [6, 7, 8, 9]. Insufficient testing can lead to unsatisfying software quality, while excessive testing can result in potential schedule delays and low cost-effectiveness.
Many existing approaches employed either risk-driven or value-based analysis for prioritizing or selecting test cases, and minimizing test runs [10, 11, 12, 13, 14], in order to effectively plan and manage testing process. However, none of these is applicable to the emerging crowd testing paradigm where managers typically have no control over online crowdworkers’ dynamic behavior and uncertain performance. Worse still, there is no existing method to support crowdtesting management.
Consequently, due to lack of decision support, in practice, project managers typically plan for the close time of crowdtesting tasks solely based on their personal experience. However, it is very challenging for managers to come up with reasonable experience-based crowdtesting decisions. This is because our investigation on real-world crowdtesting data (Section II-C) reveals that there are large variations in bug arrival pattern of crowdtesting tasks, and in task’s duration and consumed cost for achieving the same quality level. Furthermore, crowdtesting is typically treated as black box and managers’ decisions remain insensitive to actual testing progress. Hence, managers have significant challenges deciding when to intervene and close the task (and improve cost-effectiveness as well).
To address these challenges, this paper aims at exploring automated decision support to effectively manage crowdtesting. In detail, we focus on exploring dynamical bug arrival data associated with crowdtesting reports, and investigate whether it is possible to determine that, at certain point of time, a task has achieved satisfactory bug detection level (e.g. indicated by a percentage), based on the dynamic bug arrival data.
The proposed iSENSE111iSENSE is named considering it likes a sensor in crowdtesting process to raise the awareness of the testing progress. applies incremental sampling technique to process crowdtesting reports arriving in chronological order, organizes them into fixed-size groups as dynamic inputs, and integrates the Capture-ReCapture (CRC) model and the Autoregressive Integrated Moving Average (ARIMA) model to raise awareness of crowdtesting progress. CRC model is widely applied to estimate the total population based on the overlap generated by multiple captures [15, 16, 17, 18]. ARIMA model is commonly used to model time series data to forecast the future trend [19, 20, 21, 22]. iSENSE predicts two test completion indicators in an incrementally manner, including: 1) total number of bugs predicted with CRC model, and 2) required test cost for achieving certain test objectives predicted with ARIMA model. To the best of our knowledge, this is the first study to apply incremental sampling technique in crowdtesting management, so as to better model the bug arrival dynamics.
iSENSE is evaluated using 218 crowdtesting tasks from one of the largest Chinese crowdtesting platforms. Results show that, the median errors on iSENSE
’s prediction performance (of total bugs, and required cost) are both below 3%, with about 10% standard deviation during the second half of the crowdtesting process.
We further demonstrate its applications through two typical decision scenarios, one for automating task closing decision, and the other for semi-automation of task closing trade-off analysis. The results show that decision automation using iSENSE will provide managers with greater opportunities to achieve cost-effectiveness gains of crowdtesting. Specifically, a median of 100% bugs can be detected with 30% saved cost based on the automated close prediction.
The contributions of this paper are as follows:
Empirical observations on crowdtesting bug arrival patterns based on industrial dataset, which has motivated this study and can motivate future studies.
Integration of incremental sampling technique to model crowdtesting bug arrival data.
Development of CRC-based model for predicting total number of bugs, and ARIMA-based model for predicting required cost for achieving certain test objectives.
iSENSE approach for automated decision support in crowdtesting management, including automating task closing decision, and semi-automation of task closing trade-off analysis.
Evaluation of iSENSE on 46,434 reports of 218 crowdtesting tasks from one of the largest crowdtesting platforms in China, and results are promising222Url for iSENSE website with experimental dataset, source code and detailed experimental results is blinded for review.
Ii Background and Motivation
In general crowdtesting practice, managers prepare the crowtesting task (including the software under test and test requirements), and distribute it on certain online crowdtesting platform. Crowdworkers can sign in their interested tasks and submit test reports, typically summarizing test input, test steps, test results, etc.
The crowdtesting platform receives and manages crowdtesting reports submitted by the crowdworkers. Project managers then inspect and verify each report for their tasks manually or using automatic tool support (e.g., [23, 4] for automatic report labeling). Generally, each report will be characterized using two attributes: 1) whether it contains a valid bug333In our experimental platform, a report corresponds to either 0 or 1 bug, and there is no reports containing more than 1 bugs.; 2) if yes, whether it is a duplicate bug that has been previously reported by other crowdworkers.
In the following paper, if not specified, when we say “bug” or “unique bug”, we mean the corresponding report contains a bug and the bug is not the duplicate of previously submitted ones.
Ii-B BigCompany DataSet
Our experimental dataset is collected from BigCompany444Blinded for review. crowdtesting platform, which is one of the largest platforms in China. The dataset contains all tasks completed between May. 1st 2017 and Jul. 1st 2017. In total, there are 218 tasks, with 46434 submitted reports. The minimum, average, and maximum number of crowdtesting reports (and unique bugs) per task are 101 (6), 213 (26), and 876 (89), respectively.
Ii-C Observations From A Pilot Study
To understand the bug arrival patterns of crowdtesting, we conduct a pilot study to analyze three bug detection metrics, i.e. bug detection speed, bug detection cost, and bug detection rate.
For each task, we first identify the time when K% bugs have been detected, where we treat the number of historical detected bugs as the total number. K is ranged from 10 to 100. Then, the bug detection speed for a task can be derived using the duration (measured in hours) between its open time and the time it receives K% bugs. Next, the bug detection cost for a task can be derived using the number of submitted reports555Note that, the primary cost in crowdtesting is the reward to crowdworkers, and their submitted reports are usually equally paid [2, 5]; Hence, the number of received reports is treated as the consumed cost for simplicity in this study. by reaching K% bugs.
To examine bug detection rate, we break the crowdtesting reports for each task into 10 equal-sized groups, in chronological order. The rate for each group is derived using the ratio between the number of unique bugs and the number of reports in the corresponding group.
In addition, for each crowdtesting task, we also count the percentage of accumulated bugs (denoted as bug arrival curve) for the previous K reports, where K ranges from 1 to the total number of reports.
Next, we present two general bug arrival patterns derived from the pilot study.
Ii-C1 Large Variation in Bug Arrival Speed and Cost
In Figure 0(a), we first present four example bug arrival curves randomly selected from 218 crowdtesting tasks, illustrating the diversity of bug arrival curves among different tasks.
In general, there is large variation in bug arrival speed and cost. Figure 0(b) and 0(c) demonstrates the distribution of bug detection speed and bug detection cost for all tasks. It is obvious that, to achieve the same K% bugs, there is large variation in both metrics. This is particularly true for a larger K%. For example, when detecting 90% bugs, the bug detection speed ranges from 3 hours to 149 hours, while the bug detection cost ranges from 27 to 435 reports.
Ii-C2 Decreasing Bug Arrival Rates Over Time
Figure 0(d) shows the bug detection rate of the 10 break-down groups across all tasks. We can see that the bug detection rate decreases sharply during the crowdtesting process. This signifies that the cost-effectiveness of crowdtesting is dramatically decreasing for the later part of the process.
In addition, from Figure 0(a), we can also see that during the later part of the crowdtesting task, there is usually a flat area in bug arrival curve, denoting no new bugs submitted. This further suggests the potential opportunity for introducing automated closing decision support to increase cost-effectiveness of crowdtesting.
Ii-C3 Needs of Automated Decision Support
In addition, an unstructured interview666We present the details about this interview in iSENSE website. was conducted with the managers of BigCompany, with findings shown below.
Project managers commented the black-box nature of crowdtesting process. While they can receive constantly arriving reports, they are often out of clue about the remaining number of bugs as yet undetected, or the required cost to find those additional bugs.
Because they could not know what is going on of the crowdtesting, the management of crowdtesting is conducted as a guesswork. This frequently results in many blind decisions in task planning and management.
In Summary, because there are large variations in bug arrival speed and cost (Section II-C1), current decision making is largely done by guesswork. This results in low cost-effectiveness of crowdtesting (Section II-C2). A more effective alternative to manage crowdtesting would be to dynamically monitor the crowdtesting reports and automatically alert managers or close tasks when certain pre-specified test objectives are met, e.g. 90% bugs have been detected, to save unnecessary cost wasting on later arriving reports.
Furthermore, current practice suggests a practical need to empower managers with greater visibility into the crowdtesting processes (Section II-C3), and ideally raise their awareness about task progress (i.e., remaining number of bugs, and required cost to meet certain test objectives), thus facilitate their decision making.
This paper intends to address these practical challenges by developing a novel approach for automated decision support in crowdtesting management, so as to improve cost-effectiveness of crowdtesting.
Figure 2 presents an overview of iSENSE. It consists of three main steps. First, iSENSE adopts an incremental sampling process to model crowdtesting reports. During the process, iSENSE converts the raw crowdtesting reports arrived chronologically into groups and generates a bug arrival lookup table to characterize information on bug arrival speed and diversity. Then, iSENSE integrates two models, i.e. CRC and ARIMA, to predict the total number of bugs contained in the software, and the required cost for achieving certain test objectives, respectively. Finally, iSENSE applies such estimates to support two typical crowdtesting decision scenarios, i.e., automating task closing decision, and semi-automation of task closing trade-off analysis. We will present each of the above steps in more details.
Iii-a Preprocess Data based on Incremental Sampling Technique
Incremental sampling technique  is a composite sampling and processing protocol. Its objective is to obtain a single sample for analysis that has an analytic concentration representative of the decision unit. It improves the reliability and defensibility of sampling data by reducing variability when compared to conventional discrete sampling strategies.
Considering the submitted crowdtesting reports of chronological order (Section II-A), when smpSize (smpSize is an input parameter) reports are received, iSENSE treats it as a representative group to reflect the multiple parallel crowdtesting sessions. Remember in Section II-A, we mentioned that, each report is characterized as: 1) whether it contains a bug; 2) whether it is duplicate of previously submitted reports; if no, it is marked with a new tag; if yes, it is marked with the same tag as the duplicates. During the crowdtesting process, we maintain a two-dimensional Bug Arrival Lookup Table to record these information (as Table I).
After each sample is received, we first add a new row (suppose it is row i) in the lookup table. We then go through each report contained in this sample. For the reports not containing a bug, we ignore it. Otherwise, if it is marked with the same tag as existing unique bugs (suppose it is column j), record 1 in row i column j. If it is marked with a new tag, add a new column in the lookup table (suppose it is column k), and record 1 in row i column k. For the empty cells in row i, fill it with 0.
Iii-B Predict Total Bugs Using CRC
Iii-B1 Background about CRC
. Existing CRC models can be categorized into four types according to bug detection probability (i.e. identical vs. different) and crowdworker’s detection capability (i.e. identical vs. different), as shown in TableII.
Model M0 supposes all different bugs and crowdworkers have the same detection probability. Model Mh supposes that the bugs have different probabilities of being detected. Model Mt supposes that the crowdworkers have different detection capabilities. Model Mth supposes different detection probabilities for different bugs and crowdworkers.
|Crowdworker’s detection capability|
|Bug detection||Identical||M0 (M0)||Mt (MtCH)|
|probability||Different||Mh (MhJK, MhCH)||Mth (Mth)|
Based on the four basic CRC models, various estimators were developed. According to a recent systematic review , MhJK, MhCH, MtCH are the three most frequently investigated and most effective estimators in software engineering. Apart from that, we investigate another two estimators (i.e., M0 and Mth) to ensure all four basic models are investigated.
Iii-B2 How to Use in iSENSE
iSENSE treats each sample as a capture (or recapture). Based on the bug arrival lookup table, it then predicts the total number of bugs in the software using the CRC estimator. This section first demonstrates how it works with Mth estimator.
|Var.||Meaning||Computation based on bug arrival lookup table||Example value|
|N||Predicted total number of bugs||predicted value: 24|
|D||Actual number of bugs captured so far||Number of columns||12|
|t||Number of captures||Number of rows||6|
|Number of bugs detected in each capture||Number of cells with 1 in row j||3, 2, 2, 5, 6, 4|
|Number of bugs captured exactly times in all captures, i.e.,||Count the number of cells with 1 in each column, and denote as ; is the number of with value k||1=7, 2=2, 3=2, 5=1|
For the usage of other four estimators, one can find the equation for estimating the total bugs from related work (i.e.,  for M0,  for MtCH,  for MhCH, and  for MhJK). The value assignments for the variables are the same as Mth. Due to space limit, we put the detailed illustration on our website.
Iii-C Predict Required Cost Using ARIMA
Iii-C1 Background about ARIMA
. It extends ARMA (Autoregressive Moving Average) model by allowing for non-stationary time series to be modeled, i.e., a time series whose statistical properties such as mean, variance, etc. are not constant over time.
A time series is said to be autoregressive moving average (ARMA) in nature with parameters , if it takes the following form:
Where is the current stationary observation, for are the past stationary observations, is the current error, and for are the past errors. If this original time series is non-stationary, then differences can be done to transform it into a stationary one . These differences can be viewed as a transformation denoted by , where where is known as a backshift operator. When this differencing operation is performed, it converts an ARMA (Autoregressive Moving Average) model into an ARIMA (Autoregressive Integrated Moving Average) model with parameters .
Iii-C2 How to Use in iSENSE
Figure 3 demonstrates how ARIMA is applied in predicting future trend of bug arrival. We treat the reports of each sample as a window, and obtain the number of unique bugs submitted in each sample from bug arrival lookup table. Then we use the former trainSize windows to fit the ARIMA model and predict the number of bugs for the later predictSize windows. When new window is formed with the newly-arrived reports, we move the window by 1 and obtain the newly predicted results.
Suppose one want to know how much extra cost is required for achieving X% bugs. As we already know the predicted total number of bugs (Section III-B), we can figure out how many bugs should be detected in order to meet the test objective (i.e., X% bugs); suppose it is Y bugs. Based on the prediction of ARIMA, we then obtain when the number of Y bugs can be received, suppose it needs extra reports. In this way, we assume is the required cost for meeting the test objective.
Iii-D Apply iSENSE to Two Decision Scenarios in Crowdtesting
To demonstrate the usefulness of iSENSE, we generalize two typical decision scenarios in crowdtesting management, and illustrate its application to each scenario.
Iii-D1 Automating Task Closing Decision
The first scenario that can benefit from the prediction of total bugs of iSENSE (Section III-B) is decision automation of dynamic task closing.
As soon as a crowdtesting task begins, iSENSE can be applied to monitor the actual bug arrival, constantly update the bug arrival lookup table, as well as keep tracking of the percentage of bugs detected (i.e., the ratio of the number of submitted bugs so far over the predicted total bugs).
In such scenario, different task close criteria can be customized in iSENSE so that it automatically closes the task when the specified criterion is met. For instance, a simple criterion would be to close the task when 100% bugs have been detected in submitted reports. Under this criterion, when iSENSE monitors 100% bugs have received and the prediction remains unchanged for successive two captures, it determines the time, when the last report was received, as the close time; and would automatically close the crowdtesting task at run time. Note that the restriction of two successive captures is to ensure the stability of the prediction. This is because our investigation reveals that without this restriction, iSENSE would occasionally fall into quite bad performance777We also experimented with other restrictions (i.e., 1 to 5), results turn out that restriction with 2 can obtain relative good and stable performance; hence we only present these results due to space limit..
iSENSE supports flexible customization of the close criteria. As an example, a task manager can set to close his/her tasks when 80% bugs have been detected. Consequently, iSENSE will help to monitor and close the task by reacting to these customized close criteria.
Iii-D2 Semi-Automation of Task Closing Trade-off Analysis
The second scenario that benefits from the prediction of required cost of iSENSE (Section III-C) is decision support of task closing trade-off analysis.
For example, suppose 90% bugs have been reported at certain time, iSENSE can simultaneously reveal the estimated required cost for detecting an additional X% bugs (i.e., 5%), in order to achieve a higher bug detection level. Such cost-benefit related insights can provide managers with more confidence in making informed, actionable decision on whether to close immediately, if the required cost is too high to be worthwhile for additional X% more detected bugs, or wait a little longer, if the required cost is acceptable and additional X% detected bugs is desired.
Iv Experiment Design
Iv-a Research Questions
Four research questions are formulated to investigate the performance of the proposed iSENSE.
The first two research questions are centered around accuracy evaluation of the prediction of total bugs and required cost. Presumably, to support practical decision making, these underlying predictions should achieve high accuracy.
RQ1: To what degree can iSENSE accurately predict total bugs?
RQ2: To what degree can iSENSE accurately predict required cost to achieve certain test objectives?
The next two research questions are focused on investigating the effectiveness of applying iSENSE in the two typical scenarios (Section III-D), in which iSENSE is expected to alleviate current practices through automated and semi-automated decision support in managing crowdtesting tasks.
RQ3: To what extent can iSENSE help to increase the effectiveness of crowdtesting through decision automation?
RQ4: How iSENSE can be applied to facilitate the trade-off decisions about cost-effectiveness?
Iv-B Evaluation Metrics
We measure the accuracy of prediction based on relative error, which is the most commonly-used measure for accuracy [23, 30, 31]. It is applied in the prediction of total number of bugs (Section V-A) and required cost (Section V-B).
We measure the cost-effectiveness of close prediction (Section V-C) based on two metrics, i.e. bug detection level (i.e. %bug) and cost reduction (i.e. %reducedCost).
%bug is the percentage of bugs detected by the predicted close time. We treat the number of historical detected bugs as the total number. The larger %bug, the better.
%reducedCost is the percentage of saved cost by the predicted close time. To derive this metric, we first obtain the percentage of reports submitted at the close time, in which we treat the number of historical submitted reports as the total number. We suppose this is the percentage of consumed cost and %reducedCost is derived using 1 minus the percentage of consumed cost. The larger %reducedCost, the better.
Iv-C Experimental Setup
For RQ1, we set up 19 checkpoints in the range of receiving 10% to 100% reports, with an increment interval of 5% in between. At each checkpoint, we obtain the estimated total number of bugs at that time (see Section III-B). Based on the ground truth of actual total bugs, we then figure out the relative error (Section IV-B) in predicting the total bugs for each task.
For RQ2, we also set 19 checkpoints as RQ1. Different from RQ1, the checkpoints of RQ2 is based on the percentage of detected bugs, i.e. from 10% bugs to 100% bugs with an increment of 5% in between. At each checkpoint, we predict the required test cost (Section III-C) to achieve an additional 5% bugs, i.e. target corresponding to the next checkpoint. For example, at the checkpoint when 80% bugs have detected, we predict the required cost for achieving 85% bugs. Based on the ground truth of actual required cost, we then figure out the relative error (Section IV-B) in predicting required cost for each task.
For RQ3, we analyze the effectiveness of task closing automation with respect to five sample close criteria, i.e., close the task when 80%, 85%, 90%, 95%, or 100% bugs have detected, respectively. These five close criteria are consistent with the commonly-used test completion criteria in software testing, and we believe the similar principles can be adopted in crowdtesting as well.
For RQ4, we use several illustrative cases from experimental projects to show how iSENSE can help trade-off decisions.
To further evaluate the advantages of our proposed iSENSE, we compare it with two baselines.
: This baseline is adopted from one of the most classical models for predicting the dynamic defect arrival in software measurement. Generally, it supposes the defect arrival data following the Rayleigh probability distribution. In this experiment, we implement code to fit specific Rayleigh curve (i.e. the derived Rayleigh model) based on each task’s bug arrival data, and then predict the total bugs, as well as the future bug trend (and further obtain the required cost for certain test objective), using the derived Rayleigh model.
Naive: This baseline is designed to employ naive empirical results, i.e. the median obtained from the experimental dataset. More specifically, for the prediction of total bugs, it uses the median total bugs from 218 experimental tasks. For required cost, it uses the median required cost from 218 experimental tasks, in terms of the corresponding checkpoint (Section IV-C).
Iv-E Parameter Tuning
For each CRC estimator, the input parameter is smpSize, which represents how many reports are considered in each capture. To determine the value of this parameter, we random select 2/3 crowdtesting tasks to tune the parameter, and repeat the tuning for 1000 times to alleviate the randomness.
In each tuning, for every candidate parameter value (we experiment from 2 to 30) and for each checkpoint, we obtain the median relative error for the prediction of total bugs (as shown in Table IV) in terms of 218 experimental tasks. Then for each candidate parameter value, we sum all absolute values of relative error for all checkpoints. We treat the parameter value, under which the smallest sum is obtained, as the best one. Finally, we use the parameter value which appears most frequently in the 1000 random experiments. The tuned smpSize values are respectively 8 for M0, 8 for MtCH, 6 for MhCH, 3 for MhJK, and 8 for Mth.
For ARIMA model, we use the same method for deciding the best parameter value. The tuned parameter values are as follows: smpSize is 3, trainSize is 10, p, q, and d are respectively 5, 1, 0.
V-a Answers to RQ1 : Accuracy of Total Bugs Prediction
|iSENSE vs. Rayleigh||0.00*||0.00*||0.00*||0.00*||0.00*||0.00*||0.00*||0.00*||0.00*||0.00*||0.00*||0.00*||0.00*||0.00*||0.00*||0.00*||0.00*||0.00*||0.00*|
|iSENSE vs. Naive||0.02*||0.81||0.43||0.01*||0.00*||0.00*||0.00*||0.00*||0.00*||0.00*||0.00*||0.00*||0.00*||0.00*||0.00*||0.00*||0.00*||0.00*||0.00*|
|Rayleigh vs. Naive||0.00*||0.00*||0.00*||0.46||0.97||0.53||0.10||0.00*||0.00*||0.00*||0.00*||0.00*||0.00*||0.00*||0.00*||0.00*||0.00*||0.00*||0.00*|
Table IV demonstrates the median and standard deviation for the relative error of predicted total bugs for all five CRC estimators, corresponding to all checkpoints. We highlight (in italic font and red color) methods which have the best two performance with respect to each checkpoint. Due to space limit, we only present the detailed performance for MhJK (the worst estimator) and Mth (the best estimator) in Figure 4, 5.
From Table IV and Figure 5, we can see that, the predicted total number of bugs becomes more close to the actual total number of bugs (i.e., the relative error decreases) towards the end of the tasks. Among the five estimators, Mth and MhCH have the smallest median relative error for most checkpoints. But the variance of MhCH is much larger than that of Mth. Hence, estimator Mth is more preferred because of its relatively higher stability and accurate prediction in total number of bugs. In the following experiments, if not specially mentioned, the results are referring to those generated from iSENSE with Mth estimator due to space limit.
Comparison With Baselines: Table V compares the prediction accuracy of iSENSE and the two baselines, in terms of the median and standard deviation of relative error. The columns correspond to different checkpoints, and the best performer under each checkpoint are highlighted. Table VI summarizes the results of Mann-Whitney U Test for the relative error of predicted total bugs between each two methods. It shows that the iSENSE significantly outperforms the two baselines (with p-value <0.05), especially during the later stage (i.e. after the 40% checkpoint) of the crowdtesting tasks.
Answers to RQ1: iSENSE with the best estimator Mth is surprisingly accurate in predicting the total bugs in crowdtesting, and significantly outperforms the two baselines. More specifically, the median of predicted total bugs is equal with the ground truth (i.e., median relative error is 0). Better yet, the standard deviation is about 10% to 20% during the latter half of the process.
V-B Answers to RQ2 : Accuracy of Required Cost Prediction
|iSENSE vs. Rayleigh||0.00*||0.00*||0.00*||0.00*||0.00*||0.00*||0.00*||0.00*||0.00*||0.00*||0.00*||0.00*||0.00*||0.00*||0.00*||0.00*||0.00*||0.00*||0.00*|
|iSENSE vs. Naive||1.00||0.97||1.00||1.00||1.00||1.00||0.94||1.00||0.96||0.01*||0.00*||0.00*||0.11||0.01*||0.00*||0.00*||0.00*||0.00*||0.00*|
|Rayleigh vs. Naive||0.00*||0.00*||0.00*||0.00*||0.00*||0.00*||0.00*||0.00*||0.00*||0.00*||0.00*||0.00*||0.00*||0.00*||0.00*||0.00*||0.00*||0.00*||0.12|
Table VII summarizes the comparison of median and standard deviation of the relative error of predicted required cost across iSENSE and the two baselines, with columns corresponding to different checkpoints. We highlight the methods with the best performance under each checkpoint. Table VIII additionally presents the results from the Mann-Whitney U Test between each pair.
As indicated by the decreasing median relative error in Table VII, the prediction of required cost becomes increasingly accurate for later checkpoints. For example, after 50% checkpoint, the median relative error of predicted cost is lower than 3%, with about 15% standard deviation. This implies that iSENSE can effectively predict the required cost to targeted test objectives.
Comparison With Baselines: We can see that the median and standard deviation of relative error for two baselines are worse than iSENSE during the second half of the task process. Observed from Table VIII, the difference between the proposed iSENSE and two baselines is significant during the second half of crowdtesting process (p-value <0.05). This further signifies the advantages of the proposed iSENSE.
Answers to RQ2: iSENSE can predict the required test cost within averagely 3% relative error for later stage of crowdtesting (i.e. after 50% checkpoint).
V-C Answers to RQ3: Effectiveness of Task Closing Automation
Let us first look at the last row in Table IX, which reflects a close criterion of 100% bugs being detected (i.e., most commonly-used setup). The results indicate that a median of 100% bugs can be detected with 29.9% median cost reduction. This suggests an additional 30% more cost-effectiveness for managers if equiped with such a decision automation tool as iSENSE to monitor and close tasks automatically at run-time. The reduced cost is a tremendous figure when considering the large number of tasks delivered in a crowdtesting platform. In addition, the standard deviation is relatively low, further signifying the stability of iSENSE in close automation.
We then shift our focus on other four customized close criteria (i.e., 80%, 85%, 90%, and 95% in terms of percentage of detected bugs). We can observe that for each close criterion, the median %bug generated from iSENSE is very close to the targeted close criterion, with small standard deviation. Among these close criteria, 36% to 52% cost can be saved, which further signify the effectiveness of iSENSE.
We also notice that, the median %bug is a little larger than the customized close criterion. For example, if the project manager hopes to close the task when 90% bugs detected, a median of 92% bugs have submitted at the predicted close time. This implies, in most cases, the close prediction produced by iSENSE do not have the risk of insufficient testing. Furthermore, we have talked with the project managers and they thought, detecting slightly more bugs (even with less reduced cost) is always better than detecting fewer bugs (with more reduced cost). This is because %bug is more like the constraint, while %reducedCost is only the bonus.
We also analyze the reason for this phenomenon. It is mainly because, before suggesting close, our approach requires the predicted total bugs remain unchanged for two successive captures (Section III-D1). This restriction is to alleviate the risk of insufficient percentage of detected bugs. Besides, this is also because we treat a sample of reports as the unit during the prediction, which can also potentially result in the close time being a little later than the customized close time.
Answers to RQ3: The automation of task closing by iSENSE can make crowdtesting more cost-effective, i.e., a median of 100% bugs can be detected with 30% saved cost.
V-D Answers to RQ4: Trade-off Decision Support
Considering the large number of tasks under management at the same time, a typical trade-off scenario is to strategically allocate limited testing budgets among the tasks. To reflect such trade-off context, we randomly pick a time and slice the experimental dataset to retrieve all tasks under testing at that time, then examine the cost-effectiveness of more testing on those tasks.
Figure 7 demonstrates 4 trade-off analysis examples across 6 tasks (i.e. P1-P6), generated from repeating the above analysis at four different time points (i.e. corresponding to time1 to time4 in a sequential order). The y-axis denotes the next test objective to achieve, while x-axis shows the predicted required cost to achieve that objective.
Generally speaking, the crowdtesting tasks in the right area are less cost-effective than the tasks in the left area. For example, at time3, P6 is estimated to require additional 14 cost in order to achieve 90% test objective. If the manager is facing budget constraints or trying to improve cost-effectiveness, he/she could choose to close P6 at time3, because it is the least effective one among all tasks. In another example, at time1, P2 is estimated to only require 3 additional cost to reach the next objective (i.e., 70%). This suggests the investment on 3 extra cost is highly worthwhile in increasing its quality to the next objective.
To facilitate such kind of trade-off analysis on which task to close and when to close, we design two decision parameters as inputs from decision maker: 1) quality benchmark which sets the minimal threshold for bug detection level, e.g. the horizontal red lines in Figure 7; 2) cost benchmark which sets the maximal threshold for test cost to achieve the next objective, e.g. the vertical blue lines in Figure 7.
These two benchmarks split the tasks into four regions at each slicing time (as indicated by the four boxes in each subfigure of Figure 7
). Each region suggests different insights on the test sufficiency as well as cost-effectiveness for more testing, which can be used as heuristics to guide actionable decision-making at run time. More specifically:
Lower-Left (Continue): Tasks in this region are low hanging fruits, only requiring relatively less cost to achieve next test objective, and quality level is not acceptable yet; this indicates the most cost-effective option and testing should definitely continue.
Lower-Right (Drill down): Tasks here have not met the quality benchmark, so continue testing is preferred even though they require significant more cost to achieve quality objective. In addition, it likely suggests that the task is either difficult to test, or the current crowdworker participation is not sufficient. Therefore, managers would probably want to drill down in these tasks, and see if more testing guidelines or worker incentives are needed.
Upper-Left (Think twice): Tasks here already meet their quality benchmark, possibly reaching next higher quality level if with little additional cost investment. Managers should think twice before they take the action.
Upper-Right (Close): Tasks in this region require relatively more cost to reach next test objective, and current bug detection level is already high enough. This indicates that it is practical to close them considering the cost-effectiveness.
Note that, the two benchmarks in Figure 7 can be customized according to practical needs.
Answers to RQ4: iSENSE provides practical insights to help managers make trade-off analysis on which task to close or when to close, based on two benchmark parameters and a set of decision heuristics.
Vi-a Best CRC Estimator for Crowdtesting
In traditional software inspection or testing activities, MhJK, MhCH, and MtCH have been recognized as the most effective estimators for total bugs [33, 34, 17, 15, 18, 35, 36]. However, in crowdtesting, the most comprehensive estimator Mth (see Section III-B1) outperforms the other CRC estimators. This is reasonable because crowdtesting is conducted by a diversified pool of crowdworkers with different levels of capability, experience, testing devices, and the nature of bugs in the software under test also vary greatly in terms of types, causes, and detection difficulty, etc. In such cases, Mth, which assumes different detection probability for both bugs and crowdworkers (see Section III-B1), supposes to be the most suitable estimator for crowdtesting.
Vi-B Necessity for More Time-Sensitive Analytics in Crowdtesting Decision Support
As discussed earlier in the background and motivational pilot study (Section II-C), challenges associated with crowdtesting management mainly lie in two aspects: uncertainty in crowdworker’s performance and lack of visibility into crowdtesting progress. We believe there is an increasing need for introducing more time-sensitive analytics to support better decision making to fully realize the potential benefits of crowdtesting.
Compared with the two baselines, iSENSE provides additional visibility into the testing progress and insights for effective task management. In particular, during the later stage of crowdtesting process, the performance of iSENSE is significantly better than the baselines (see Table VII).
As discussed in answering RQ4, iSENSE can generate time-based information revealing dynamic crowdtesting progress and provide practical guidelines to help managers make trade-off analysis on which task to close or when to close, based on a set of decision heuristics.
This suggests a significant portion of crowdtesting cost can be saved through employing effective decision support approaches such as iSENSE. This is extremely encouraging and we look forward to more discussion and innovative decision support techniques in this direction.
Vi-C Threats to Validity
The external threats concern the generality of this study. Firstly, our experiment data consists of 218 crowdtesting tasks collected from one of the Chinese largest crowdtesting platforms. We can not assume that the results of our study could generalize beyond this environment in which it was conducted. However, the diversity of tasks and size of dataset relatively reduce this risk. Secondly, our designed methods are largely dependent on the report’s attributes (i.e., whether it contains a bug; and whether it is the duplicates of previous ones) assigned by the manager. This is addressed to some extent due to the fact that we collected the data after the crowdtesting tasks were closed, and they have no knowledge about this study to artificially modify their assignment.
Internal validity of this study mainly questions the baselines. As there is no existing methods for managing crowdtesting tasks, we choose one commonly-used method for managing software quality, and one method based on empirical observations of crowdtesting, as the baselines to demonstrate the advantage of our proposed iSENSE.
Construct validity of this study mainly concerns the experimental setup for determining the parameter value. We use the most frequent tuned optimal parameter values, which can allleviate the randomness, to examine the performance of our proposed iSENSE.
Vii Related Work
Crowdtesting has been applied to facilitate many testing tasks, e.g., test case generation , usability testing , software performance analysis , software bug detection and reproduction . These studies leverage crowdtesting to solve the problems in traditional testing activities, some other approaches focus on solving the new encountered problems in crowdtesting.
Feng et al. [41, 42] proposed approaches to prioritize test reports in crowdtesting. They designed strategies to dynamically select the most risky and diversified test report for inspection in each iteration. Jiang et al.  proposed the test report fuzzy clustering framework to reduce the number of inspected test reports. Wang et al. [23, 44, 4]
proposed approaches to automatically classify crowdtesting reports. Cui et al.[5, 45] and Xie et al.  proposed crowdworker selection approaches to recommend appropriate crowdworkers for specific crowdtesting tasks.
In this work, we focus on the automated decision support for crowdtesting management, which is valuable to improve the cost-effectiveness of crowdtesting and not explored before.
Vii-B Software Quality Management
Many existing approaches proposed risk-driven or value-based analysis to prioritize or select test cases [10, 11, 12, 13, 14, 47], so as to improve the cost-effectiveness of testing. However, none of these is applicable to the emerging crowd testing paradigm where managers typically have no control over online crowdworkers’s dynamic behavior and uncertain performance.
There are also existing researches focusing on defect prediction and effort estimation [48, 31, 49, 50, 51]. The core part of these approaches is the extraction of features from the source code, or software repositories. However, in crowdtesting, the platform can neither obtain the source code of these apps, nor involve in the software development process of these apps.
Many existing approaches focused on applying over-sampling and under-sampling techniques to alleviate the data imbalance problem in predictions [52, 30, 53]. However, what we faced is not the data imbalance problem, but the dynamic and uncertain bug arrival data. This is why we employed incremental sampling in this study.
Several researches focused on studying the time series models for measuring software reliability [54, 55, 56, 57, 58, 59, 9]. Among these, ARIMA is the most promising model for mapping system failures over time. It has been applied in estimating software failures , predicting the evolution in maintenance phase of system project , predicting the monthly number of changes of a software project , modeling time series changes of software , etc. This paper used ARIMA in modeling the bug arrival dynamics in crowdtesting and estimating future trend.
Another body of previous researches aimed at optimizing software inspection by predicting the total and remaining number of bugs. Eick et al.  reported the first work on employing capture-recapture models in software inspections to estimate the number of faults remaining in requirements and design artifacts. Following that, several researches focused on evaluating the influence of number of inspectors, the number of actual defects, the dependency within inspectors, the learning style of individual inspectors, on the capture-recapture estimators’ accuracy [33, 34, 17, 15, 18, 35, 36]. The aforementioned approaches are based on different types of capture-recapture models, and results turned out MhJK, MhCH, and MtCH are the most effective estimators. We have reused all these estimators and experimentally evaluated them in crowdtesting.
Benefits of crowdtesting have largely been attributed to its potential to get test done faster and cheaper. Motivated by the empirical observations from an industry crowdtesting platform, this study aimed at developing automated decision support to address management blindness and achieve additional cost-effectiveness.
The proposed iSENSE employ incremental sampling technique to address the dynamic, parallel characteristics of bug arrival data, and integrate two classical prediction models, i.e. CRC and ARIMA, to raise managers’ awareness of testing progress through two indicators (i.e., total number of bugs, required cost to achieve certain test objectives). Based on the indicators, iSENSE can be used to automate the task closing and semi-automate trade-off decisions. Results show that decision automation using iSENSE can largely improve the cost-effectiveness of crowdtesting. Specifically, a median of 100% bugs can be detected with 30% cost reduction.
It should be pointed out that the presented material is just the starting point of the work in progress. We are collaborating with BigCompany and begin to deploy iSENSE online. Future work includes conducting further evaluation on a broader scope of datasets, incorporating more real-world crowdtesting application scenarios, conducting more evaluational experiments in industry settings, and improving the usability of iSENSE based on evaluation feedback.
-  K. Mao, L. Capra, M. Harman, and Y. Jia, “A survey of the use of crowdsourcing in software engineering,” Journal of Systems and Software, vol. 126, pp. 57–84, 2017.
-  X. Zhang, Y. Feng, D. Liu, Z. Chen, and B. Xu, “Research progress of crowdsourced software testing,” Journal of Software, vol. 29(1), pp. 69–88, 2018.
-  http://www.softwaretestinghelp.com/crowdsourced-testing-companies/, 2018.
-  J. Wang, Q. Cui, S. Wang, and Q. Wang, “Domain adaptation for test report classification in crowdsourced testing,” in Proceedings of the 39th International Conference on Software Engineering: Software Engineering in Practice Track. IEEE Press, 2017, pp. 83–92.
-  Q. Cui, J. Wang, G. Yang, M. Xie, Q. Wang, and M. Li, “Who should be selected to perform a task in crowdsourced testing?” in 2017 IEEE 41st Annual Computer Software and Applications Conference, vol. 1. IEEE, 2017, pp. 75–84.
-  G. J. Myers, C. Sandler, and T. Badgett, The art of software testing. John Wiley & Sons, 2011.
-  W. E. Lewis, Software testing and continuous quality improvement. CRC press, 2016.
-  M. Garg, R. Lai, and S. J. Huang, “When to stop testing: a study from the perspective of software reliability models,” IET software, vol. 5, no. 3, pp. 263–273, 2011.
-  J. Iqbal, N. Ahmad, and S. Quadri, “A software reliability growth model with two types of learning,” in 2013 International Conference on Machine Intelligence and Research Advancement. IEEE, 2013, pp. 498–503.
-  S. Wang, J. Nam, and L. Tan, “QTEP: quality-aware test case prioritization,” in Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering. ACM, 2017, pp. 523–534.
-  A. Shi, T. Yung, A. Gyori, and D. Marinov, “Comparing and combining test-suite reduction and regression test selection,” in Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering. ACM, 2015, pp. 237–247.
-  M. G. Epitropakis, S. Yoo, M. Harman, and E. K. Burke, “Empirical evaluation of pareto efficient multi-objective regression test case prioritisation,” in Proceedings of the 2015 International Symposium on Software Testing and Analysis. ACM, 2015, pp. 234–245.
-  R. K. Saha, L. Zhang, S. Khurshid, and D. E. Perry, “An information retrieval approach for regression test prioritization based on program changes,” in 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, vol. 1. IEEE, 2015, pp. 268–279.
-  C. Henard, M. Papadakis, M. Harman, Y. Jia, and Y. Le Traon, “Comparing white-box and black-box test prioritization,” in 2016 IEEE/ACM 38th International Conference on Software Engineering. IEEE, 2016, pp. 523–534.
-  G. Rong, B. Liu, H. Zhang, Q. Zhang, and D. Shao, “Towards confidence with capture-recapture estimation: An exploratory study of dependence within inspections,” in Proceedings of the 21st International Conference on Evaluation and Assessment in Software Engineering, 2017, pp. 242–251.
-  G. Liu, G. Rong, H. Zhang, and Q. Shan, “The adoption of capture-recapture in software engineering: a systematic literature review,” in Proceedings of the 19th International Conference on Evaluation and Assessment in Software Engineering. ACM, 2015, p. 15.
-  Y. H.Chun, “Estimating the number of undetected software errors via the correlated capture-recapture model,” European Journal of Operational Research, vol. 175, no. 2, pp. 1180 – 1192, 2006.
-  N. R. Mandala, G. S. Walia, J. C. Carver, and N. Nagappan, “Application of kusumoto cost-metric to evaluate the cost effectiveness of software inspections,” in Proceedings of the ACM-IEEE International Symposium on Empirical Software Engineering and Measurement, 2012, pp. 221–230.
-  A. Amin, L. Grunske, and A. Colman, “An approach to software reliability prediction based on time series modeling,” Journal of Systems and Software, vol. 86, no. 7, pp. 1923–1932, 2013.
-  C. Chong Hok Yuen, “On analyzing maintenance process data at the global and the detailed levels: A case study,” in Proceedings of the IEEE Conference on Software Maintenance, 1988, pp. 248–255.
-  C. F. Kemerer and S. Slaughter, “An empirical approach to studying software evolution,” IEEE Transactions on Software Engineering, vol. 25, no. 4, pp. 493–509, 1999.
-  I. Herraiz, J. M. Gonzalez-Barahona, and G. Robles, “Forecasting the number of changes in eclipse using time series analysis,” in Fourth International Workshop on Mining Software Repositories. IEEE, 2007, pp. 32–32.
-  J. Wang, Q. Cui, Q. Wang, and S. Wang, “Towards effectively test report classification to assist crowdsourced testing,” in Proceedings of the 10th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement. ACM, 2016, p. 6.
-  L. Mora-Applegate and M. Malinowski, “Incremental sampling methodology,” Interstate Technology and Regulatory Council (ITRC), Tech. Rep., 2012.
-  S.-M. Lee, “Estimating population size for capture-recapture data when capture probabilities vary by time, behavior and individual animal,” Communications in Statistics-Simulation and Computation, vol. 25, no. 2, pp. 431–457, 1996.
-  P. S. Laplace, “Sur les naissances, les mariages et les morts,” Histaire de I’Academie Royale des Sciences, p. 693, 1783.
-  A. Chao, “Estimating the population size for capture-recapture data with unequal catchability,” Biometrics, pp. 783–791, 1987.
-  ——, “Estimating animal abundance with capture frequency data,” The Journal of Wildlife Management, pp. 295–300, 1988.
-  K. P. Burnham and W. S. Overton, “Estimation of the size of a closed population when capture probabilities vary among animals,” Biometrika, vol. 65, no. 3, pp. 625–633, 1978.
-  M. Tan, L. Tan, S. Dara, and C. Mayeux, “Online defect prediction for imbalanced data,” in Proceedings of the 37th International Conference on Software Engineering-Volume 2. IEEE Press, 2015, pp. 99–108.
-  J. Nam, W. Fu, S. Kim, T. Menzies, and L. Tan, “Heterogeneous defect prediction,” IEEE Transactions on Software Engineering, 2017.
-  S. H. Kan, Metrics and models in software quality engineering. Addison-Wesley Longman Publishing Co., Inc., 2002.
-  L. C. Briand, K. E. Emam, B. G. Freimut, and O. Laitenberger, “A comprehensive evaluation of capture-recapture models for estimating software defect content,” IEEE Transactions on Software Engineering, vol. 26, no. 6, pp. 518–540, 2000.
-  G. S. Walia and J. C. Carver, “Evaluating the effect of the number of naturally occurring faults on the estimates produced by capture-recapture models,” in 2009 International Conference on Software Testing Verification and Validation, 2009, pp. 210–219.
A. Goswami, G. Walia, and A. Singh, “Using learning styles of software
professionals to improve their inspection team performance,”
International Journal of Software Engineering and Knowledge Engineering, vol. 25, no. 09-10, pp. 1721–1726, 2015.
-  P. Vitharana, “Defect propagation at the project-level: results and a post-hoc analysis on inspection efficiency,” Empirical Software Engineering, vol. 22, no. 1, pp. 57–79, 2017.
-  N. Chen and S. Kim, “Puzzle-based automatic testing: Bringing humans into the loop by solving puzzles,” in Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering. ACM, 2012, pp. 140–149.
-  V. H. M. Gomide, P. A. Valle, J. O. Ferreira, J. R. G. Barbosa, A. F. da Rocha, and T. M. G. d. A. Barbosa, “Affective crowdsourcing applied to usability testing,” Int. J. of Computer Science and Information Technologies, vol. 5, no. 1, pp. 575–579, 2014.
-  R. Musson, J. Richards, D. Fisher, C. Bird, B. Bussone, and S. Ganguly, “Leveraging the crowd: How 48,000 users helped improve lync performance,” IEEE Software, vol. 30, no. 4, pp. 38–45, 2013.
-  M. Gómez, R. Rouvoy, B. Adams, and L. Seinturier, “Reproducing context-sensitive crashes of mobile apps using crowdsourced monitoring,” in 2016 IEEE/ACM International Conference on Mobile Software Engineering and Systems. IEEE, 2016, pp. 88–99.
-  Y. Feng, Z. Chen, J. A. Jones, C. Fang, and B. Xu, “Test report prioritization to assist crowdsourced testing.” in Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, 2015, pp. 225–236.
-  Y. Feng, J. A. Jones, Z. Chen, and C. Fang, “Multi-objective test report prioritization using image understanding,” in 2016 31st IEEE/ACM International Conference on Automated Software Engineering. IEEE, 2016, pp. 202–213.
-  H. Jiang, X. Chen, T. He, Z. Chen, and X. Li, “Fuzzy clustering of crowdsourced test reports for apps,” ACM Transactions on Internet Technology, vol. 18, no. 2, p. 18, 2018.
-  J. Wang, S. Wang, Q. Cui, and Q. Wang, “Local-based active classification of test report to assist crowdsourced testing,” in 2016 31st International Conference on Automated Software Engineering. IEEE, 2016, pp. 190–201.
-  Q. Cui, S. Wang, J. Wang, Y. Hu, Q. Wang, and M. Li, “Multi-objective crowd worker selection in crowdsourced testing,” in 29th International Conference on Software Engineering and Knowledge Engineering, 2017, pp. 218–223.
-  M. Xie, Q. Wang, G. Yang, and M. Li, “Cocoon: Crowdsourced testing quality maximization under context coverage constraint,” in 2017 IEEE 28th International Symposium on Software Reliability Engineering. IEEE, 2017, pp. 316–327.
A. Panichella, R. Oliveto, M. Di Penta, and A. De Lucia, “Improving multi-objective test case selection by injecting diversity in genetic algorithms,”IEEE Transactions on Software Engineering, vol. 41, no. 4, pp. 358–383, 2015.
-  T. Menzies, J. Greenwald, and A. Frank, “Data mining static code attributes to learn defect predictors,” IEEE transactions on software engineering, vol. 33, no. 1, pp. 2–13, 2007.
-  A. Agrawal and T. Menzies, ““better data” is better than “better data miners” (benefits of tuning smote for defect prediction),” in Proceedings of the 40th International Conference on Software engineering, 2018.
-  E. Kocaguneli, T. Menzies, and J. W. Keung, “On the value of ensemble effort estimation,” IEEE Transactions on Software Engineering, vol. 38, no. 6, pp. 1403–1416, 2012.
-  P. K. Singh, D. Agarwal, and A. Gupta, “A systematic review on software defect prediction,” in 2015 2nd International Conference on Computing for Sustainable Global Development. IEEE, 2015, pp. 1793–1797.
-  S. Wang, T. Liu, and L. Tan, “Automatically learning semantic features for defect prediction,” in Proceedings of the 38th International Conference on Software Engineering. ACM, 2016, pp. 297–308.
K. GAO, T. M. KHOSHGOFTAAR, and R. WALD, “The use of under-and oversampling within ensemble feature selection and classification for software quality prediction,”International Journal of Reliability, Quality and Safety Engineering, vol. 21, no. 01, p. 1450004, 2014.
-  D. Zeitler, “Realistic assumptions for software reliability models,” in Software Reliability Engineering, 1991. Proceedings., 1991 International Symposium on. IEEE, 1991, pp. 67–74.
C. Bai, Q. Hu, M. Xie, and S. H. Ng, “Software failure prediction based on a markov bayesian network model,”Journal of Systems and Software, vol. 74, no. 3, pp. 275–282, 2005.
-  N. Fenton, M. Neil, and D. Marquez, “Using bayesian networks to predict software defects and reliability,” Proceedings of the Institution of Mechanical Engineers, Part O: Journal of Risk and Reliability, vol. 222, no. 4, pp. 701–712, 2008.
-  M. Wiper, A. Palacios, and J. Marín, “Bayesian software reliability prediction using software metrics information,” Quality Technology & Quantitative Management, vol. 9, no. 1, pp. 35–44, 2012.
-  B. Yang, X. Li, M. Xie, and F. Tan, “A generic data-driven software reliability model with model mining technique,” Reliability Engineering & System Safety, vol. 95, no. 6, pp. 671–678, 2010.
M. das Chagas Moura, E. Zio, I. D. Lins, and E. Droguett, “Failure and reliability prediction by support vector machines regression of time series data,”Reliability Engineering & System Safety, vol. 96, no. 11, pp. 1527–1534, 2011.
-  S. G. Eick, C. R. Loader, M. D. Long, L. G. Votta, and S. Vander Wiel, “Estimating software fault content before coding,” in Proceedings of the 14th international conference on Software engineering, 1992, pp. 59–65.