Scheduling Tasks for Software Crowdsourcing Platforms to Reduce Task Failure

05/29/2020 ∙ by Jordan Urbaczek, et al. ∙ NYU college Stevens Institute of Technology 0

Context: Highly dynamic and competitive crowd-sourcing software development (CSD) marketplaces may experience task failure due to unforeseen reasons, such as increased competition over shared supplier resources, or uncertainty associated with a dynamic worker supply. Existing analysis reveals an average task failure ratio of 15.7 objective of this study is to provide a task scheduling recommendation model for software crowdsourcing platforms in order to improve the success and efficiency of software crowdsourcing. Method: We propose a task scheduling model based on neural networks, and develop an approach to predict and analyze task failure probability upon arrival. More specifically, the model uses number of open tasks in the platform, average task similarity level of new arrival task with open tasks, task monetary prize and task duration as input, and then predicts the probability of task failure on the planned arrival date with three surplus days and recommending the day associated with lowest task failure probability to post the task. The proposed model is based on the workflow and data of TopCoder, one of the primary software crowdsourcing platforms.Results: We present a model that suggests the best recommended arrival dates for any task in the project with surplus of three days per task in the project. The model on average provided 4 proposed model empowers crowdsourcing managers to explore potential crowdsourcing outcomes with respect to different task arrival strategies.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Crowdsourced Software Development (CSD) has been used increasingly to develop software applications [stol2014two]. Crowdsourcing mini software development tasks leads to lower accelerated development [saremi2017leveraging]

. In order for a CSD platform to function efficiently, it must address both the needs of task providers as demands and crowd workers as suppliers. Any kind of skew in addressing these needs leads to task failure in the CSD platform. Generally planning for CSD tasks that are complex, independent and require a significant amount of time, effort, and expertise to achieve the task requirements

[stol2014two] is challenging. For task provider, requesting a crowdsourcing service is even more challenging due to the uncertainty of the similarity among available tasks in the platform and the new arrival tasks [saremi2018hybrid][difallah2016scheduling], as well as, available crowd workers’ skill sets and performance history [karim2016decision][zaharia2010delay]. These factors raise the issue of receiving qualified submission, since crowd workers may be interested in multiple tasks from different task providers based on their individual utility factors [faradani2011s].

It is reported that crowd workers are more interested in working on tasks with similar concepts, monetary prize, technologies, complexities, priorities, and duration [faradani2011s][gordon1961general][difallah2016scheduling][yang2015award]. Attracting workers to a group of similar tasks may cause zero registration, zero submissions, or unqualified submissions for some tasks due to lack of time from workers[khanfor2017failure][khazankin2011qos], however, lower level of task similarity in the platform leads to higher chance of task success and workers’ elasticity[saremi2020right].

For example, in Topcoder111 https://www.topcoder.com/, a well-known Crowdsourcing Software platform, on average 13 tasks arrive daily added to on average 200 existing tasks, simply more demand. Moreover, there is on average 137 active workers to take the tasks at that period which leads to on average 25 failed tasks. According to this example, there will be a long queue of tasks waiting to be taken. Considering the fixed submission date, such waiting line may result is failed tasks. Such challenges traditionally addressed with task scheduling methods.

The objective of this study is to provide a task schedule recommendation framework in software crowdsourcing platform in order to improve the success and efficiency of software crowdsourcing. In this study, we first present a motivational example to explain the current task status in software crowdsourcing platform. Then, we propose a task scheduling architecture base on neural network to reduce probability of task failure in the platform.

More specifically, the framework uses number of open tasks in the platform, average task similarity level of new arrival task with open tasks, task monetary prize and task duration as input for neural network , and then predicts the probability of task failure on the planned arrival date with three surplus days for task manager to recommend the day associated with lowest task failure probability to post the task. The proposed model represents a task scheduling method for competitive crowdsourced platforms based on the workflow of Topcoder, one of the primary software crowdsourcing platforms. The evaluation results provided on average 4% lower task failure probability,

The remainder of this paper is structured as follows. Section II introduces a motivational example that inspires this study. Section III presents background and review of available works. Section IV outlines our research design and methodology. Section V presents the case study and model evaluation, and Section VI presents the conclusion and outlines a number of directions for future work.

Ii Motivating Example

The motivation example illustrates a real crowdsourcing software development (CSD) project on the TopCoder platform. It was comprised of 41 tasks with a total project duration of 207 days with an average of 8 days per task. The project experienced a 47% task failure ratio, which means 19 of the 41 tasks failed. 6 tasks were failed due to client requests ( i.e 14% failure), and 7 tasks were failed based on failed requirements (17% failure). The remaining eight tasks (i.e 14% failure) failed due to zero submissions.

Fig. 1: Overview of Tasks’ Status and Similarity Level in the Platform

If we ignore the task failed based on client request and failed requirements, 28 tasks remains, (see Figure1). Deeper analysis reveals that most of failed tasks entered the pool of tasks with similarity above 80%.

Also, as figure 2 illustrates that on average each task compete with 145 similar open task upon arrival. Number of open tasks can directly impact on attracting suitable workers and lead to task failure. It is reported that degree of task similarity in the pool of tasks directly impacts on task competition level and task success [saremi2020right].

Fig. 2: Number of Open Tasks in Platform upon Task Arrival

It seems task failure is a result of power of task competition in the platform. This observation motivates us to investigate more and provide a task scheduling recommendation model which helps reducing task failure based on power of task competition in the platform.

Iii Background

Iii-a Task Scheduling in CSD

Different characteristics of the machine and human behavior creates delays in product release[ruhe2005art]. This fact leads to lack of systematic processes to balance the delivery of features with the available resources [ruhe2005art]. Therefore, improper scheduling would result in task starvation [faradani2011s]. Parallelism in scheduling is a great method to create the chance of utilizing a greater pool of workers [ngo2008optimized, saremi2015empirical] as this method encourages workers to specialize and complete the task in shorter period and promote solutions in which benefits the requestor to clearly understand how workers decide to compete on a task and analyze the crowd workers performance [faradani2011s]. Shorter schedule planning can be one of the most notable advantages of using CSD for managers [lakhani2010topcoder].

Batching tasks in similar groups is another effective method to reduce the complexity of tasks, and it will dramatically reduce cost[marcus2011human]. Batching crowdsourcing tasks would lead to a faster result than approaches which keep workers separate and is also quicker than the average of the fastest individual worker [bernstein2011crowds]. There is a theoretical minimum batch size for every project as one of the principles of product development flow [reinertsen2009celeritas]. To some extent, the success of software crowdsourcing is associated with reduced batch size in small tasks. Besides, the delay scheduling method [zaharia2010delay] was specially designed for crowdsourced projects to maximize the probability of a worker receiving tasks from the same batch of tasks they were performing. An extension of this idea introduced a new method called “fair sharing schedule” [ghodsi2011dominant]. In this method, various resources would be shared among all tasks with different demands, which ensures that all tasks would receive the same amount of resources to be fair. For example, this method was used in Hadoop Yarn. Later, Weighted Fair Sharing (WFS) [difallah2016scheduling] was presented as a method to schedule batches based on their priority. Tasks with higher priority are introduced first.

Another proposed crowd scheduling method is based on quality of service (QOS) [khazankin2011qos], a skill-based scheduling method with the purpose of minimizing scheduling while maximizing quality by assigning the task to the most available qualified worker. This scheme was created by extending standards of Web Service Level Agreement (WSLA) [ludwig2003web]. The third available method method is HIT-Bundle [difallah2016scheduling] a batch container which schedules heterogeneous tasks into the platform from different batches. This method makes for a higher outcome by applying different scheduling strategies at the same time. The most recent method is helping crowdsourcing-based service providers to meet completion time SLAs [hirth2019task]. The system works based on the oldest task waiting time and run a stimulative evaluation to recommend best scheduling strategy in order to reduce the task failure ratio.

Iii-B Task Similarity in CSD

Generally, workers tend to optimize their personal utility factor to register for a task [faradani2011s]. It is reported that workers are more interested in working in similar tasks in terms of monetary prize [yang2015award], context and technology [difallah2016scheduling], and complexity level. Context switch generates reduction in workers’ efficiency [difallah2016scheduling]. However, workers usually try to register for a greater number of tasks than they can complete [yang2016should]. It is reported that high task similarity level negatively impacts task competition level and team elasticity [saremi2020right]. Combination of these observations lead to receiving task failure due to: 1) receiving zero registration for task based on low degree of similar tasks and lack of available skillful worker [yang2015award], and 2) receiving non-qualified submissions or zero submissions based on lack of time to work on all the registered task by the worker[archak2010money].

Iii-C Challenges in CSD

Considering the highest rate for task completion and accepting submissions, software managers will be more concerned about the risks of adopting crowdsourcing. Therefore, there is a need for better decision-making system to analyze and control the risk of insufficient competition and poor submissions due to the attraction of untrustworthy workers. A traditional method of addressing this problem in the software industry is task scheduling. Scheduling is helpful in prioritizing access to the resources. It can help managers to optimize task execution in the platform to attract the most reliable and trustworthy workers. Normally, in traditional methods, task requirements and phases are fixed, while cost and time are flexible. In a time-boxed system, time and cost are fixed, while, task requirements and phases are flexible [cooper2016agile]. However, in CSD all three variables are flexible. This factor creates a huge advantage in crowdsourcing software projects.

Generally, improper scheduling could lead to task starvation [faradani2011s], since workers with high abilities tend to compete with low skilled workers [archak2010money]. Hence, users are more likely to choose tasks with fewer competitors [yang2008crowdsourcing]. Also, workers intentionally choosing less popular tasks to participate, could potentially enhance winning probabilities, even if workers share similar expertise. It brings some severe problems in the CSD trust system and causes a lot of dropped and none-completed tasks. Moreover, tasks with relatively lower monetary prizes have a high probability to be chosen and be solved, which results in only 30% of problems in platform being solved [rapoport1966game]. This may attract higher numbers of workers to compete and consequently makes the higher chance of starvation for more expensive tasks and project failure.

The above issues indicate the importance of task scheduling in the platform in order to attract the right amount of trustworthy and expert workers as well as shorten the release time.

Fig. 3: Overview of Scheduling Architecture

Iv Research Design and Methodology

To solve the scheduling problem, we designed a model to predict the probability of task failure and recommend arrival date based on lowers predicted failure. To do so we utilized a neural network model to predict the task failure per day, then we add a search based optimizer to recommend arrival day with lowest failure probability. This architecture can be operated on any crowdsourcing platform, however we focused on TopCoder as the target platform. In this method task arrival date is suggested based on degree of task similarity in the platform and reliability of available workers in making a valid submission. Figure 3 presented the overview of the task scheduling architecture. Tasks from the new project submitted by the client. Each task is uploaded in the task scheduler. Task failure predictor analyzes task probability of failure in the platform based on number of similar open tasks in the same day, average similarity, task duration and associated monetary prize, then recommend probability of task failure for assigned date with in three days surplus. In next step, the task manager selects the most suitable arrival date among the three recommended days and schedule task to be posted. The result of task performance in the platform collects to be reported to the client as well as used as the input to task recommender.

Type Metrics Definition
Tasks attributes Task registration start date (TR) The first day of task arrival in the platform and workers can start registering for it. Range: (0, ))
Task submission end date (TS) Deadline that all workers who registered for task have to submit their final results. Range: (0, )).
Task registration end date (TRE) last day that a task is available to be registered for. Range: (0, )).
Monetary Prize (P) Monetary prize (Dollars) in task description. Range: (0, )).
Technology (Tech) Required programming language to perform the task.Range: (0, ))
Platforms (PLT) Number of platforms used in task. Range: (0, )).
Tasks performance Task Status Completed or failed tasks
# Registration (R) Number of registrants that are willing to compete on total number of tasks in specific period of time. Range: (0, ).
# Submissions (S) Number of submissions that a task receives by its submission deadline in specific period of time. Range: (0, # registrants].
# Valid Submissions (VS) Number of submissions that a task receives by its submission deadline and passed the peer review in specific period of time. Range: (0, # registrants].
TABLE I: Summary of Metrics Definition

Iv-a Dataset

The gathered dataset contains 403 individual projects including 4,908 component development tasks and 8,108 workers from Jan 2014 to Feb 2015, extracted from Topcoder website. Tasks are uploaded as competitions in the platform, where Crowd software workers would register and complete the challenges. On average most of the tasks have a life cycle of 14 days from first day of registration to the submission’s deadline. When the workers submit the final files, it will be reviewed by experts to check the results and labeled it as valid or unvalid submission. TableLABEL:metrics summarized the task features available in the dataset.

Iv-B Input to the Model

It is reported that task monetary prize and task duration [yang2015award][faradani2011s] are the most important factor to attract competition level for a task. In this research we are adding our observation form motivation example (i.e number of open tasks and average task similarity) to the reported list of important factors as an input of the presented model. To provide an effective task scheduling in CSD after understanding the data, average task similarity, task duration, task actual monetary prize, number of open tasks and probability of task failure in the platform was defined as below as an input of the model. probability of task failure was used as the reward function in order to trained the neural network model.

First we need to understand the degree of task similarity among a set of simultaneously open tasks in the platform. Def.1: Task Similarity, , Similarity between two tasks and is defined as the weighted sum of all local similarities across the features listed in TableII :

Feature Description of distance measure
Task Monetary Prize (P) ( - ) =
Task registration start date (TR) ( - ) =
Task submission end date (TS) - ) =
Task Type ( == ) ? 1 : 0
Technology (Tech) Match(:)=
Platform (PL) ( == ) ? 1 : 0
Detailed Requirement ()
TABLE II: Features used to measure task distance

Def.2: Task Duration, , is the total available time from task (i) registration start date date to submissions end date :

Def.3: Actual Prize, , the prize that the winner () and runner up( ) will receive after passing peer review.

Def.4: Number of Open Tasks, , the Number of tasks() that are open to register at the new task arrival time ().

Def.5: Average Task Similarity , the average Similarity score between the arrival task and open tasks in the platform.

Def.6: Task Failure Rate , the probability of the arrival task is not receiving valid submission and passing the peer review per day.

Iv-C Output of the Model

The goal of the proposed model is to make sure that not only we can predict the probability of failure of new arrival task in the arrival day, but also we have the capability of recommending the most suitable posting day to decrease the task failure probability with the surplus of three days. To do so, we run the model and evaluate the result for arrival day, one day after, and two days after. To predict task failure in future days we need to know the number of expected arrivals and open tasks per day as well as their task similarity score.

Def.6: Rate of task Arrival per day , Considering the fact that at any point of time the number of tasks that will be closed tomorrow is known, the rate of task arrival per day was defined as ratio of number of open tasks per day by total duration of tasks per day .

By knowing the rate of task arrival per day, number of open task tomorrow is defined as:

Def.7: Number of Open Tasks in One Day , number of tasks that still will be open in one day , adding to the rate of task arrival per day .

Also we need to know the average task similarity in future days. Def.7: Average Task Similarity in One Day After , is defined as number of tasks that still will be open in one day after times average task similarity of tasks that are still open one day after arrival , adding to the rate of task arrival per day times average task similarity of arrival day .

Iv-D Neural Network Architecture

A fully connected feed forward neural network was trained to predict failure rates based on the four features described above. The network is configured with five layers of size 32, 16, 8, 4, 2, and 1. Training used a batch sizes of 8 for 200 epochs, with a mean square error loss. The train/test split was 80% training set and 20% validation set.

V Case Study and Model Evaluation

To test the accuracy and validation of the presented model, we used the proposed model to reschedule the project from motivation example. The result is discussed in below:

V-a Result of the Model

The Presented model was run based on the data from motivated example with the goal of finding out the most effective arrival day to reduce the failure probability. Figure 4 presents the initial result of the model.

Fig. 4: Comparison of initial Task Failure Prediction and Actual Failure

As it is shown in figure 4 the result of recommended arrival dates to the task project by presented model provides the average failure prediction of 0.83 which is 0.03 lower than the actual scheduling. The result of initial failure prediction by the model in closer to the mean of actual failure, with standard devotion of 0.09. Interestingly, the duration of the project didn’t changed under new scheduling recommendation. Table LABEL:Fstat summarized the statistic of actual failure and prediction failure for the project.

Statistics Actual P(failure) Predicted P(failure)
Min 0.42 0.61
Max 1 0.94
Mean 0.86 0.83
Median 0.90 0.84
Std 0.15 0.09
TABLE III: summary of actual and prediction failure

In next step, the model provides prediction for one day after and two days after the predicted day. This provides more insight for a project manager to choose the most suitable arrival date for their task. Figure 5 illustrates the result of the failure prediction of all the three dates.

Fig. 5: Details of Failure Prediction per Task for all level of predictions

While tasks 8,15,18,20,25,28 receiving the lowest failure prediction on the second day with the average prediction of 0.81, and tasks 9, 10, 15, 22, 23, 26 receive lowest failure prediction on day 3 with average prediction of 0.8, lowest failure prediction happened on the first day for the rest of tasks with average of 0.81. Yet, not only the average of all the three prediction is lower than the actual failure prediction, but also most of the prediction points in all the three days are lower than the average of actual failure.

Fig. 6: Comparison of Task Failure Prediction for final Schedule and Actual Schedule

Having access to the prediction in figure 5 provides the opportunity to plan for the task scheduling with the minimum task failure probability with in 3 days surplus per each task. Figure 6

present the failure probability for the project following the lowest failure prediction per task in comparison with the actual task failure. It is clear that the recommended schedule provides more stationary probability of task failure with the average of 0.81, while the probability of task failure for the actual task schedule is 0.86. The recommended task scheduling provides the minimum failure prediction of 0.61, maximum 0.94 with standard deviation of 0.09. The accuracy of the model is 0.896.

To evaluate the model performance, we applied Mean Square Error (MSE) to estimate the difference of the final failure ratio with the actual failure prediction in the same day according to available data. Figure

7 presents the MSE for failure prediction per task. The average MSR is 0.09 with a minimum of 0.001 for task 3, and a maximum of 0.23 for task 1, with a standard deviation of 0.06.

Fig. 7: MSR for Probability of Task Failure per Task

V-B Model Validation

To compare the performance of the proposed model, we applied a Leave-One-Out Cross-Validation on the dataset to predict the probability of task failure based on four different prediction approaches. The estimated probability of task failure are used to compute four popular performance measures that are widely used in current prediction system for software development as follow: 1- Mean magnitude of Relative Error (), 2- Median magnitude of Relative Error (), 3- Standard Deviation magnitude of Relative Error (), 4- Percentage of the estimates with Relative Error less than or equal to N% ().

Fig. 8: Performance of Task Failure Probability by Each Approach

The primary result of this analysis is shown in figure 8. It is clear that Neural Network analysis has a better predictive performance according to

and also it has almost the lowest error rate with the average rate of 14% , while SVR recreation is the runner up performance with the average error of 15%. Interestingly Moving average and linear regression provides the same level of performance based on

) while linear regression provides lower average error.

V-C Threats to Validity

First, the study only focuses on competitive CSD tasks on the TopCoder platform. Many more platforms do exist, and even though the results achieved are based on a comprehensive set of about 5,000 development tasks, the results cannot be claimed externally valid. There is no guarantee the same results would remain exactly the same in other CSD platforms.

Second, there are many different factors that may influence tasks similarity, and task success and completion. Our similarity algorithm and task failure probability approach are based on known task attributes in TopCoder. Different similarity algorithm and task failure probability approaches may lead us to different but almost similar results.

Third, the result is based on tasks only. Workers network and communication was not considered in this research. In future we need to add this level of research to the existing one.

Vi Conclusion and future work

CSD provides software organizations access to online infinite worker resource supply. Assigning tasks to a pool of unknown workers from all over the glob is challenging. A traditional approach to solve such challenge is task scheduling. Improper task scheduling in CSD may cause to zero task registration, zero task sub-missions or low qualified submissions due to uncertain workers behavior and consequently task failure. This research presents new scheduling model based on neural network to reduce the probability of task failure in CSD platforms. The experimental experience lead to reducing project failure probability up to 4% with in the same project duration.

In future research we will focus on the expanding the model to a more complicated frame work with involving available workers similarity and considering the impact of workers’ competition performance on the task success to provide more efficient scheduling model.

References