Increasingly, software is being developed using continuous deployment methods (Paasivaara et al., 2018; Santos et al., 2016; Hohl et al., 2018; Parnin et al., 2017). In such projects, software is never “finished” in the traditional sense. Rather, it is constantly being evolved in response to an ever-changing set of requirements.
Boehm et al.’s COCOMO model can derive software effort estimations (Boehm et al., 2000), but it assumes the projects use a waterfall developing style (which is not compatible with open-source development).
Musa et al. (Musa, 1993) can predict the mean time till next failure in safety critical systems. But it is hard to apply that style of analysis to open-source project development since it assumes that the code base is essentially stable (which is not true for open-source projects). Also, open-source projects rarely track and accurately record their mean time between failures.
To better address the needs of management, the software engineering community needs new kinds of prediction systems. Specifically, software engineering managers need project health indicators that assess the health of a project at some future point in time. This is useful for many reasons.
Commercial companies can avoid using open-source packages that are expected to grow unhealthy.
Open-source vendors can automatically monitor the health of the packages in their ecosystem. Those vendors can then decide what packages to eject from their next release of (e.g.) an open-source operating system.
Also, for packages that are very important to an ecosystem, vendors can detect and repair packages with falling health.
Lastly, for organizations that maintain large suites of open-source packages, project health indicators can intelligently decide how to move staffs between different projects.
In theory, predicting software project health is a complicated process. Projects that are continuously evolving are also continuously changing as they react to perpetually changing circumstances. In such a chaotic environment, our pre-experimental intuition is that it would be very difficult to predict software project health.
The good news offered in this paper, is that such predictions are possible. We find that open-source projects obey the law of large numbers. That is, they offer stable long-term results for the averages across the many random events within a project. Writing in the 1940s(Asimov, 1950), Asimov conjectured that while one can not foresee the actions of a particular individual, the laws of large numbers as applied to large groups of people could predict the general flow of future events. To make that argument, he used the analogy of a gas:
While it is difficult to predict the activity of a single molecule in a gas, kinetic theory can predict the mass action of the gas to a high level of accuracy.
70 years later, in 2020, we can now assert that for open-source software, Asimov’s conjecture is correct. We show that
While it is difficult to predict the activity of a single developer in a project, data mining can predict the mass action of the project to a high level of accuracy.
Han et al. note that popular open-source projects tend to be more active (Han et al., 2019). Also, many other researchers agree that healthy open-source projects need to be “vigorous” and “active” (Wahyudin et al., 2007; Jansen, 2014; Manikas and Hansen, 2013; Link and Germonprez, 2018; Wynn Jr, 2007; Crowston and Howison, 2006). Hence, to assess project health, we look at project activity. Specifically, using 78,455 months of data from GitHub, we make predictions for the April 2020 activity within 1,628 GitHub projects; specifically:
The number of contributors who will work on the project;
The number of commits that project will receive;
The number of open pull-requests in that project;
The number of closed pull-requests in that project;
The number of open issue in that project;
The number of closed issue in that project;
Project popularity trends (number of GitHub “stars”).
This paper is structured around the following research questions.
RQ1: Can we predict trends in project health indicators?
We apply five popular machine learning algorithms (i.e., KNN, SVR, LNR, RFT and CART) and one state-of-the-art hyperparameter-optimized predictor (DECART) to 1,628 open-source projects collected from GitHub. Once we collectedmonths of data, we made predictions for the current status of each project (as of April 2020) using data from months one to for months in the past. DECART’s median error in those experiments is under 10% (where this error is calculated from using the predicted and actual values seen after training on past months and testing for the April 2020 values). Hence, we say:
Answer 1: Many project health indicators can be predicted, with good accuracy, for 1, 3, 6, 12 months into the future.
RQ2: What features matter the most in prediction? To find the most important features that have been used for prediction, we look into the internal structure of the best predicting model, and count the number of times that each feature has been used when predicting the monthly trends.
Answer 2: In our study, “monthly_ISSUEcomments”, “monthly_commit”, “monthly_fork” and “monthly_star” are the most important features, while “monthly_PRmerger” is the least used feature for all seven health indicators’ predictions.
RQ3: Which methods achieve the best prediction performance? We compare the performance results of each method on all 1,628 open-source projects and predicting for 1, 3, 6, and 12 months into the future. After a statistical comparison between different learners, we find that:
Answer 3: DECART generates better predicting performance than other methods in 91% of our 1,628 projects.
Overall, the main contributions of this paper are as follows:
We demonstrate that it is possible to accurately predict the health indicators of software projects for 1, 3, 6, 12 months into the future.
For researchers wishing to reproduce/improve/refute our conclusions, we offer a collection of 78,455 health-related monthly data from 1,628 GitHub repositories.
We also show that, for this data, hyperparameter optimization is effective and fast for predicting project health indicators.
This paper is organized as follows: Section 2 explains the related work on software analytics of open-source projects, and the difference between our work and prior studies. Section 3 introduces the current problems of open-source software development, the background of software project health, and the techniques for related studies. After that, Section 4 describes open-source project data mining and the experiment setup details. Section 5 presents the experimental results and answers the research questions. This is followed by Section 6 and Section 7, which discuss the findings from the experiment and the potential threats in the study. Finally, the conclusions and future works are given in Section 8.
For a replication package of this work, please see https://github.com/randompeople404/health_indicator_2020.
2. Related Work
Our study stands out from prior work in several ways.
Firstly, we use more current data than prior studies. To the best of our knowledge, this is the largest study yet conducted, using the most recent data, for predicting multiple health indicators of open-source projects. Looking at prior work that studied multiple health indicators, two closely comparable studies to this paper are Healthy or not: A way to predict ecosystem health in GitHub by Liao et al. (Liao et al., 2019) and A Large Scale Study of Long-Time Contributor Prediction for GitHub Projects by Bao et al. (Bao et al., 2019). Those papers studied 52 and 917 projects, respectively, while we explore 1,628 projects. Further, much of our data is current (we predict for April 2020 values) while much prior work uses project data that is years to decades old (Sarro et al., 2016).
Secondly, we explore different kinds of predictions than prior work. For example, the goal of the Bao et al.’s paper is to predict if a programmer will become a long term contributor to GitHub project. While this is certainly an important question, it is a question about individuals within a project. The goal of our paper is to offer management advice at a project level.
Thirdly, we explore more kinds of predictions than prior work. Much of the prior work in open-source project just predicts a single feature (e.g. (Borges et al., 2016a; Kikas et al., 2016; Chen et al., 2014; Weber and Luo, 2014; Bao et al., 2019)). Our work is not about applying sophisticated methods to predict a particular goal. Rather, our work shows that it is effective to predict multiple goals in GitHub data, without using techniques specialized for each goal. This paper reports success on nearly all the indicators that we explore. Hence, we conjecture that there could be many more aspects of open-source projects that could be accurately predicted using methods like DECART (and this would be a fruitful area for future research).
Fourthly, our study have better predicting results than those reported previously in the software estimation literature. Recall that we achieve error rates under 10%. It is hard to directly compare that number against many other results (due to differences in experimental conditions). But what is true is that prior researchers were content with only semi-approximate predictions. Bao et al.’s predictions for 12 months into the future were still 25% away from the best possible value (see Table 25 of (Bao et al., 2019)). Sarro et al.’s ICSE’16 paper argues for the superiority of their preferred techniques after seeing error rates in five datasets of 25, 30, 40, 45, 55% (see Figure 1a of (Sarro et al., 2016)). And as for Boehm et al. (Boehm et al., 2000), they had very low expectations for his COCOMO estimation system. Specifically, they declared success if estimations were less than 30% wrong.
We conjecture that our error rates are so low since, fifthly, we use arguably better technology than prior work. Most of the prior work neglect to tune the control parameters of their learners. This is not ideal since some recent research in SE reports that such tuning can significantly improve the performance of models used in software analytics (Tantithamthavorn et al., 2016; Fu et al., 2016; Menzies and Shen, 2016; Agrawal and Menzies, 2018; Agrawal et al., 2019, 2018). Here, we use a technology called “differential evolution” (DE, explained below) to automatically tune our learners. In a result that endorses our use of this kind of hyperparameter optimization, we note that with DE, we achieve very low error rates (less than 10%).
3.1. Why Study Project Health?
In 2020, open-source projects dominate the software developing environment(Paasivaara et al., 2018; Santos et al., 2016; Hohl et al., 2018; Parnin et al., 2017). Over 80% of the software in any technology product or service are now open-source (Zemlin, 2017). With so many projects now being open-source, a natural next question is “which of these project are any good?”’ and “which should I avoid?”. In other words, we now need to assess the health condition of open-source projects before using them.
There are many business scenarios within which the predictions of this paper would be very useful. For example, many commercial companies use open-source packages in the products they sell to customers. For that purpose, commercial companies want to use packages that are predicted to stay healthy for some time to come. If otherwise, the open-source community stops maintaining those packages, then those companies will be forced into maintaining open-source packages which that they did not build and, hence, may not fully understand.
Another case where commercial organizations can use project health predictions is the issue of ecosystem package management. Red Hat is very interested in project health indicators that can be automatically applied to tens of thousands of projects. When Red Hat releases a new version of systems, the 24,000+ software packages included in that distribution are delivered to tens of millions of machines, around the world. Red Hat seeks automatic project health indicators that let it:
Decide what packages should not be included in the next distribution (due to falling health);
Detect, then repair, falling health in popular packages. For example, in 2019, Red Hat’s engineers noted that a particularly popular project was falling from favor with other developers since its regression test suite was not keeping up with current changes. With just a few thousand dollars, Red Hat used crowd sourced programmers to generate the tests that made the package viable again (Stewart, 2019).
Yet another use case where project health predictions would be useful is software staff management. Thousands of IBM developers maintain dozens of large open-source toolkits. IBM needs to know the expected workload within those projects, several months in advance (Krishna et al., 2018). Predictions such as those discussed in this paper can advise when there are too many developers working on one project, and not enough working on another. Using this information, IBM management can “juggle” that staff around multiple projects in order to match expected workload to the available staff. For example,
If a spike is expected a few months for the number of pull requests, management might move extra staff over to that project a couple of months earlier (so that staff can learn that code base).
When handling the training of newcomers, it is unwise to drop novices into some high stress scenarios where too few programmers are struggling to handle a large work load with too few personnel.
It is also useful to know when the predicted workload for a project is predicted to be stable or decreasing. In that use case, it is not ill-advised to move staff to other problems in order to
Accommodate the requests of seasoned programmers who want to either (a) learn new technologies as part of their career development; or (b) alleviate boredom;
Resolve personnel conflict issues.
3.2. Who Studies Project Health?
For all the above reasons, numerous studies and organizations are exploring the health or development features of open-source projects. For example:
Jansen et al. introduce an OSEHO (Open Source Ecosystem Health Operationalization) framework, using productivity, robustness and niche creation to measure the health of software ecosystem (Jansen, 2014).
Manikas et al. propose a logical framework for defining and measuring the software ecosystem health consisting of the health of three main components (actors, software and orchestration) (Manikas and Hansen, 2013).
A community named “CHAOSS” (Community Health Analytics for Open Source Software) contributes on developing metrics, methodologies, and software from a wide range of open-source projects to help expressing open-source project health and sustainability (Foundation, 2020).
Borges et al. claim that the number of stars of a repository is a direct measure of its popularity, in their study, they use a model with multiple linear regressions to predict the number of stars to estimation the popularity of GitHub repositories (Borges et al., 2016a).
Kikas et al. build random forest models to predict the issue close time of more than 4,000 GitHub projects, with multiple static, dynamic and contextual features. They report that the dynamic and contextual features are critical in such predicting tasks (Kikas et al., 2016).
Jarczyk et al. use generalized linear models for prediction of issue closure rate. Based on multiple features (stars, commits, issues closed by team, etc.), they find that larger teams with more project members have lower issue closure rates than smaller teams. While increased work centralization improves issue closure rates (Jarczyk et al., 2018).
3.3. How to Study Project Health?
In March 2020, we explored the literature looking for how prior researchers have explored project health. Starting with venues listed at Google Scholar Metrics “software systems”111https://scholar.google.com/citations?view_op=top_venues&hl=en&vq=eng_softwaresystems, we searched for highly cited or very recent papers discussing software analytics, project health, open source systems and GitHub predicting. We found:
In the past five years (2014 to 2019), there were at least 30 related papers.
10 of those papers looked at least one of the seven project health indicators we listed in our introduction (Liao et al., 2019; Borges et al., 2016a; Jarczyk et al., 2018; Kikas et al., 2016; Qi et al., 2017; Aggarwal et al., 2014; Chen et al., 2014; Han et al., 2019; Weber and Luo, 2014; Bidoki et al., 2018).
None of those papers explored all the indicators explored in our study.
As to the technology used in that sample, of the above related papers, the preferred learners was usually just one of the following:
LNR: linear regression model that builds regression methods to fit the data to a parametric equation;
CART: decision tree learner for classification and regression;
RFT: random forest that builds multiple regression trees, then report the average conclusion across that forest;
KNN: k-nearest neighbors that makes conclusions by average across nearby examples;
SVR: support vector regression
uses the regressions that take the quadratic optimizer used in support vector machines and uses it to learn a parametric equation that predicts for a numeric class.
Hence, for this study, we use the above learners as baseline methods. The implementation of them are obtained from Scikit-Learn (Pedregosa et al., 2011). Unless being adjusted by differential evolution (discussed below), all these are run with the default settings from off-the-shelf Scikit-Learn.
Of the above related work, a study by Bao et al. from TSE’19 seems very close to our work (Bao et al., 2019). They explored multiple learning methods for their predicting tasks. Further, while the other papers used learners with their off-the-shelf settings, Bao et al. took care to tune the control parameters of their learners. Much recent research in SE reports that such tuning can significantly improve the performance of models used in software analytics (Tantithamthavorn et al., 2016; Fu et al., 2016; Menzies and Shen, 2016; Agrawal and Menzies, 2018; Agrawal et al., 2019, 2018). The “grid-search-like” method they used was a set of nested for loops that looped over the various control parameters of the learners (so a grid search for, say, three parameters would contain three nested for loops).
We consider following the similar study as Bao et al., but decide to explore some other different aspects for several reasons:
Their data was not available to other researchers.
They explored one goal (predicting if a committer will be a long term contributor) while we want to see if it is possible to predict multiple project health indicators.
Grid search is not recommended by the data mining literature. Bergstra et al. warn that grid search suffers from the curse of dimensionality (Bergstra et al., 2011). That is, for any particular dataset and learner, the searching space of useful hyperparameters is a tiny fraction of the total space. A grid search that explores all the tuning options, which is in fine enough details to accommodate all learners and datasets, can be very slow. Hence, (a) most grid search algorithms take “large steps” in their parameter search; and (b) those large steps may miss the most useful settings of a particular learner/dataset (Bergstra et al., 2011).
The weaker performance of grid search is not just a theoretical possibility. Experimental results show that grid search can miss important options and performs worse than very simple alternatives (Fu et al., 2016). Also, grid search can run needlessly slow since, often, only a few of the tuning hyperparameters really matter (Bergstra and Bengio, 2012).
Accordingly, for this paper, we search control hyperparameters for our learners using another hyperparameter optimizer called Differential Evolution (DE) (Storn and Price, 1997). We use DE since prior work found it fast and comparatively more effective than grid search for other kinds of software analytic problems (e.g., defect prediction (Fu et al., 2016; Menzies and Shen, 2016)). Also, DE has a long history of successful application in the optimization research area, dating back to 1997 (Storn and Price, 1997). For example, Google Scholar reports that the original DE paper now has 22,906 citations (as of May 5, 2020) and that algorithm is still the focus of much on-going research (Das and Suganthan, 2010; Wu et al., 2018; Das et al., 2016). Further, as part of this study, we spent months benchmarking DE against several other hyperparameter optimizers published since 1997. We found that DE work just as well as anything else, ran much faster, and its associated code base was much simpler to build and maintain.
The pseudocode of DE algorithm is shown in Figure 1. The premise of that code is that the best way to mutate the existing tunings is to extrapolate between current solutions (stored in the frontier list). Three solutions are selected at random from the frontier. For each tuning parameter
, at some probability, DE replaces the old tuning with new where where is a parameter controlling differential weight.
The main loop of DE runs over the frontier of size , replacing old items with new candidates (if new candidate is better). This means that, as the loop progresses, the frontier contains increasingly more valuable solutions (which, in turn, helps extrapolation since the next time we pick , we get better candidates.).
DE’s loops keep repeating till it runs out of lives. The number of lives is decremented for each loop (and incremented every time we find a better solution).
Our initial experiments showed that of all these off-the-shelf learners, the CART regression tree learner was performing best. Hence, we combine CART with differential evolution to create the DECART hyperparamter optimzier for CART regression trees. Taking advice from Storn and Fu et al. (Storn and Price, 1997; Menzies and Shen, 2016), we set DE’s configuration parameters to . The CART hyperparameters we control via DE are shown in Table 1.
4.1. Data Collection
Kalliamvakou et al. warns that many repositories on GitHub are not suitable for software engineering research (Kalliamvakou et al., 2016). We follow their advice and apply a related criteria (with GitHub GraphQL API) for finding useful URLs of related projects (see Table 2). After that, to remove repositories with irrelevant topics such as “books”, “class projects” or “tutorial docs”, etc., we create a dictionary of “suspicious words of irrelevancy”, and remove URLs which contain words in that dictionary (see Table 3). After applying the criteria of Table 2 and Table 3, that left us with 1,628 projects. From these repositories, we extract features across 78,455 months of data.
Currently, there is no unique and consolidated definition of software project health (Jansen, 2014; Liao et al., 2019; Link and Germonprez, 2018). However, most researchers agree that healthy open-source projects need to be “vigorous” and “active” (Wahyudin et al., 2007; Jansen, 2014; Manikas and Hansen, 2013; Link and Germonprez, 2018; Wynn Jr, 2007; Crowston and Howison, 2006). As Han et al. mentioned, popular open-source projects tend to be more active (Han et al., 2019). In our study, we select 7 features as health indicators of open-source project on GitHub: number of commits, contributors, open pull-requests, closed pull-requests, open issues, closed issues and stars. The first six features are important GitHub features to indicate the activities of the projects, while the last one is widely used as a symbol of GitHub project’s popularity (Borges et al., 2016b; Han et al., 2019; Aggarwal et al., 2014).
All the features collected from each project in this study are listed in Table 4. These features are carefully selected because some of them were used by other researchers who explore related GitHub studies (Coelho et al., 2020; Yu et al., 2016; Han et al., 2019).
To get the latest and accurate features of our selected repositories, we use GitHub API v3 for feature collection. For each project, the first commit date is used as the starting date of the project. Then all the features are collected and calculated monthly from that date up to the present date. For example, the first commit of the kotlin-native project was in May 16, 2016. After after, we collected features from May, 2016 to April, 2020. Due to GitHub API rate limit, we could not get some features, like “monthly_commits”, which require large amount of direct API calls. Instead, we clone the repo locally and then extracted features (this technique saved us much grief with API quotas). Table 5 shows a summary of the data collected by using this method.
4.2. Performance Metrics
To evaluate the performance of learners, we use two performance metrics to measure the prediction results of our experiments: Magnitude of the Relative Error (MRE) and Standardized Accuracy (SA). We use these since (a) there are advocated in the literature (C and MacDonell, 2012; Sarro et al., 2016); and (b) they both offer a way to compare results against some baseline (and such comparisons with some baselines is considered good practice in empirical AI (Cohen, 1995)).
Our first evaluation measure metric, championed by Sarro et al. (Sarro et al., 2016) is the magnitude of the relative error, or MRE. MRE is calculated by expressing absolute residual (AR) as a ratio of actual value, where AR is computed from the difference between predicted and actual values:
For MRE, there is the case when ACTUAL equals “0” and then the metric will have “divide by zero” error. To deal with this issue, when ACTUAL gets “0” in the experiment, we set MRE to “0” if PREDICT is also “0”, or a value larger than “1” otherwise.
Sarro et al. (Sarro et al., 2016) favors MRE since, they argue that, it is known that the human expert performance for certain SE estimation tasks has a MRE of 30% (Molokken and Jorgensen, 2003b). That is to say, if some estimators achieve less than 30% MRE then it can be said to be competitive with human level performance.
MRE has been criticized because of its bias towards error underestimations (Foss et al., 2003; Kitchenham et al., 2001; Korte and Port, 2008; Port and Korte, 2008; Shepperd et al., 2000; Stensrud et al., 2003). Shepperd et al. champion another evaluation measure called “standardized accuracy”, or SA (C and MacDonell, 2012). SA is computed as the ratio of the observed error against some reasonable fast-but-unsophisticated measurement. That is to say, SA expresses itself as the ratio of some sophisticated estimate divided by a much simpler method. SA (Langdon et al., 2016; C and MacDonell, 2012) is based on Mean Absolute Error (MAE), which is defined in terms of
where is the number of data used for evaluating the performance. SA uses MAE as follows:
where is the of a large number (e.g., 1000 runs) of random guesses. Shepperd et al. observe that, over many runs, will converge on simply using the sample mean (C and MacDonell, 2012).
We find Shepperd et al.’s arguments for SA to be compelling. But we also agree with Sarro et al. that it is useful to compare estimates against some human-level baselines. Hence, for completeness, we apply both evaluation metrics. As shown below, both evaluation metrics will offer the same conclusion (that DECART’s performance is both useful and better than other methods for predicting project health indicators).
Note that in all our results: For MRE, smaller values are better, and the best possible performance result is “0”. For SA, larger are better , the best possible performance result is “100%”.
We report the median (50th percentile) and interquartile range (IQR=75th-25th percentile) of our methods’ performance.
To decide which methods do better than any other, we could not use distribution-based statistics (Kampenes et al., 2007; Arcuri and Briand, 2011; Mittas and Angelis, 2012) since, for each project, we are making one estimate about the April 2020 status of a project. Hence, we need statistical methods that ask if two measurements (from two different learners) are in different places across the same distribution (the space of performance measurements across all our learners). For this purpose, we take the advice of Rosenthal et al. (Rosenthal et al., 1994). They recommend parametric methods, rather than non-parametric ones, since the latter have less statistical power than parametric ones. Rosenthal et al. discuss different parametric methods for asserting that one result is with some small effect of another (i.e. it is “close to”). They list dozens of effect size tests that divide into two groups: the group that is based on the Pearson correlation coefficient; or the
family that is based on absolute differences normalized by (e.g.) the size of the standard deviation. Since Rosenthal et al comment that “none is intrinsically better than the other”, we choose the most direct method. We say that one result is the same as another if their difference differs by less than Cohen’s delta (). Note that we compute separately for each different evaluation measure (SA and MRE).
5.1. Can we predict trends in project health indicators? (RQ1)
We predict the value of health indicators for April 2020 by using data up until March 2020. That is, if a project is 60 months long (on April 2020), we predict for April 2020 using all data from its creation up until March 2020 (first 59 months). The median and IQR values of performance results in terms of MRE and SA are shown in Table 6, Table 7, Table 8, and Table 9, respectively.
In all these four tables, we show median and IQR of performance results across 1,628 projects, using all but the last month to make predictions for April 2020. For MRE, lower values are better. Gray cells denote better results; For SA, higher values are better. In all these tables, for each row, the best learning scheme has the darkest background.
In these results, we observe that our methods provide very different performance with these 7 health indicators’ prediction. In Table 6, we see that some learners have errors over 130% (LNR, predicting for number of commits). For the same task, other learners, however, only have around half of the errors (CART, 67%). Also in that table, the median MRE score of the untuned learners (KNN, LNR, SVR, RFT, CART) is over 50%. That is, these estimates are often wrong by a factor of two, or more. Another thing to observe is that untuned CART usually has lower MRE and higher SA values among those five untuned learners (5/7 in MRE, 4/7 in SA). Hence, we elect to use DE to tune CART. Also, these tables show that hyperparameter optimization is beneficial. The DECART columns of Table 6 and Table 8 show that this method has much better median SAs and MREs than the untuned methods. As shown in the last column of Table 6, the median error for DECART is under 10% (to be precise, 7%). On the other hand, the results of Table 7 and Table 9 also demonstrate the stability of DECART (with lowest IQR when measuring the performance variability of all methods).
Turning now to other prediction results, our next set of results show what happens when we make predictions over a 1, 3, 6, 12 months interval. Note that to simulate predicting the status of ahead , , , month, for a project with months of data, we must train on data collected from month 1 to month , , , , respectively. That is, to say that the further ahead our predictions, the less data we have for training. Hence, one thing to watch for is whether or not performance decreases as size of the training set decreases.
Table 10 presents the MRE and SA results of DECART, expressed as a ratio of the results seen after predicting one month ahead. By observing the median results (show in gray) from left to right across the table, we see that as we try to predict further and further into the future, (a) SA slightly degrades about 5% and (b) MRE degrades only around 33%, or less. Measured in absolute terms, this change is very small: recall that the median DECART MRE results in Table 6 for one-month-ahead predictions where less than 10%. This means that when Table 10 says that the median MRE for the 12 months predictions is worse by 133%, that translates to that (which is still very low).
In any case, summarizing all the above, we say that:
Answer 1: Many project health indicators can be predicted, with good accuracy, for 1, 3, 6, 12 months into the future.
The only counter result to Answer 1 is when trying to predict the number of open issues. Table 6 and Table 8 show that DECART’s worst MRE and SA predictions are for “openISSUE” health indicator. Additionally, in Table 8, all the SA predictions for openIssue are negative; i.e. we are performing very badly indeed when trying to predict how many issues will remain open next month. In retrospect, of course, we should have expected that predicting for how many new challenges will arise next month (in the form of new issues) is an inherently hard task.
5.2. What features matter the most in prediction? (RQ2)
In our experimental data, we have 12 numeric features for prediction. We use them since they are features with high importance, suggested by prior work (see Section 4.1). That said, having done all these experiments, it is appropriate and fitting to ask which features, in practice, turned out to be more useful when we predict health indicators. This information could help us to focus on useful features and remove irrelevancies when enlarging our research in the future work. To work that out, we look into the trees generated by DECART (our best learners) in the above experiments. We count the number of time of each feature has been used for prediction of every health indicator.
Those counts are summarized in Table 11. In this table, “n/a” denotes the dependent variable, which is not counted in the experiment. From this table, first of all, we find that some features are highly related to specific health indicators. For example, “fork”, “ISSUEcomment” and “commit” have been selected , and when we built trees to predict “star” indicator for repositories. Secondly, some features are bellwethers that have been used as features for multiple indicator predictions, like “commit” occurs , , and times as features when predicting “contributor”, “star” and “closeISSUE” indicators. “ISSUEcomment” has the similar pattern for “star”, “openISSUE” and “closeISSUE”. Thirdly, some features even though they belong to the same group as predicting indicator, like “openISSUE” v.s. “closeISSUE”, they are not quite highly picked up by learners. In our experiment, we find that “openISSUE” was only selected times, way less than “ISSUEcomment” (), “star” () and “commit” () for “closeISSUE” indicator. Last but not least, some features were less used than others. According to our experiment, “PRmerger” is the least used feature for all predictions (the median use-percentage of PRmerger is only 49%).
Answer 2: In our study, “monthly_ISSUEcomments”, “monthly_commit”, “monthly_fork” and “monthly_star” are the most important features, while “monthly_PRmerger” is the least used feature for all seven health indicators’ predictions.
Note that none of these 12 features should be abandoned, even for “PRmerger”, the least used feature in prediction (when predicting “star”, this feature is used in 60% of cases).
That said, we would be hard pressed to say that Table 11 indicates that only a small subset of the Table 4 features are outstandingly the most important. While Table 11 suggests that some feature pruning might be useful, overall we would suggest that using all of these features might be the best policy in most cases.
5.3. Which methods achieve the best prediction performance? (RQ3)
To answer this question, we compared the performance results of each method on all 1,628 open-source projects and predicting for 1, 3, 6, and 12 months into the future.
Across 1,628 projects, we report the percent of times that one learner generating best or nearly best predictions (and the darker the cell, the more that count). To compute “nearly best” we used the Cohen’s measure introduced in Section 4.3 to compare different learning schemes in terms of MRE and SAin Table 12 and Table 13, respectively.
The comparisons in these tables are for intra-row results, where the darker cells indicate the learning methods with higher win rate. For example, in the first row of Table 12 (except the header row), when predicting the number of commits in next month, DECART has the best MRE performance in 86% of all 1,628 cases.
As shown in Table 12, in terms of MRE, DECART achieves the best performance with winning rates from to for all predictions (the median win rate is 91%). However, the winning rates of other learners, KNN, LNR, SVR, RFT, and CART, mostly range from to with the exception of openISSUE where other methods have close win rate to DECART since none methods can predict it very well. That being said, our proposed method, DECART, outperforms other methods on almost all the predictions out of 1,628 projects by .
For SA results, as we see in Table 13, although the median win rate of DECART (72%) decreased a bit compare to MRE (91%), it still outperforms all the rest of methods (closest runner-up, CART only gets 44%). Specifically, DECAERT wins from to out of 4 different prediction ways on 1,628 projects. However, KNN wins from to , LNR wins from to , SVR wins from to , RFT wins from to and CART wins from to , respectively. Most of the time, all of the other methods wins rates are less than . After we take a further look, SVR performs relatively worse, the median win rate is only compared to the median of DECART win rate, .
Based on the results from our experiments, we conclude that:
Answer 3: DECART generates better predicting perfor-mance than other methods in 91% of our 1,628 projects (MRE, median).
|Predicting Month||Health Indicator||KNN||LNR||SVR||RF||CART||DECART|
6.1. The efficiency of DECART
DECART is not only effective (as shown in Table 12 and Table 13), but also very fast. In our study, it took 11,530 seconds to run DECART on 1,628 projects (on a dual core 4.67 GHz laptop); i.e. 7 seconds per datasets. This time includes optimizing CART for each specific dataset, and then making predictions. Note that, for these experiments, we made no use of any special hardware (i.e. we used neither GPUs nor cloud services that interleave multiple cores in some clever manner).
The speed of DECART is an important finding. In our experience, the complexity of hyperparameter optimization is a major concern that limits its widespread use. For example, Fu et al. report that hyperparameter optimization for code defect reduction requires nearly three days of CPU per dataset (Menzies and Shen, 2016). If all of our 1,600+ datasets needed the same amount of CPU, then that would be a major deterrent to the use of the methods of this paper.
But why is DECART so fast and effective? Firstly, DECART runs fast since it works on very small datasets. This paper studies three to five years of project data. For each month, we extract the 12 features shown in Table 4. That is to say, DECART’s optimizations only have to explore datasets with data points per project. Fu et al. on the other hand, worked on more than 100,000 data points.
Secondly, as to why is DECART so effective, we note that many data mining algorithms rely on statistical properties that are emergent in large samples of data (Witten et al., 2011). Hence they have problems reasoning about datasets with only data points. Accordingly, to enable effective data mining, it is important to adjust the learners to the idiosyncrasies of the dataset (via hyperparameter optimization).
|Predicting Month||Health Indicator||KNN||LNR||SVR||RF||CART||DECART|
6.2. DECART on other time predictions
We observe that for the performance results in Table 6, while predicting the number of closed pull requests, CART and DECART achieve a 0% error for this indicator. Such zero error is a red flag that needs to be investigated since they might be due to a programming error (such as use the test value as both the predicted and actual value for the MRE calculation). What we found was that the older the project, the less the programmer activity. Hence, it is hardly surprising that good learners could correctly predict (e.g.) zero closed pull requests.
But that raised another red flag: suppose all our projects had reached some steady state prior to April 2020. In that case, predicting (say) next month’s health would be a simple matter of repeating last month’s value. In our investigation, we have three reasons for believing that this is not the case. Firstly, prediction in this domain is difficult. If such steady state had been achieved, then all our learners would be reporting very low errors. As seen in Table 6, this is not the case.
Secondly, we looked into the columns in our raw data, looking for long sequences of stable or zero values. This case does not happen in most cases: our data contains much variation across the entire lifecycle of our projects.
Thirdly, just to be sure, we conducted another round of experiments. Instead of predicting for April 2020, we do the prediction for April 2019 using data collected prior to April 2018. Table 14 shows the results. In this table, if a project had (say) months of data, we went to months and used DECART to predicted 12 months into the future (to ). The columns for Table 14 should be compared to the right-hand-side columns of Table 6, Table 7, Table 8, and Table 9. In that comparison, we see that predicting for month generates comparable results as predicting for month using all data from months to .
In summary, our results are not unduly biased by predicting just for April 2020. As the evidence, we can still obtain accurate results if we predict for April 2019 using data from before April 2018.
7. Threats to validity
The design of this study may have several validity threats (Feldt and Magazinius, 2010). The following issues should be considered to avoid jeopardizing conclusions made from this work.
Parameter Bias: The settings to the control hyperparameters of the predicting methods can have a positive effect on the efficacy of the prediction. By using hyperparameter-optimized method in our experiment, we explore the space of possible hyperparameters for the predictor, hence we assert that this study suffers less parameter bias than some other studies.
Metric Bias: We use Magnitude of the Relative Error (MRE) as one of the performance metrics in the experiment. However, MRE is criticized because of its bias towards error underestimations (Foss et al., 2003; Kitchenham et al., 2001; Korte and Port, 2008; Port and Korte, 2008; Shepperd et al., 2000; Stensrud et al., 2003). Specifically, when the benchmark error is small or equal to zero, the relative error could become extremely large or infinite. This may lead to an undefined mean or at least a distortion of the result (Chen et al., 2017). In our study, we do not abandon MRE since there exist known baselines for human performance in effort estimation expressed in terms of MRE (Molokken and Jorgensen, 2003a). To overcome this limitation, we set a customized MRE treatment to deal with “divide by zero” issue and also apply Standardized Accuracy (SA) as the other measure of the performance.
Sampling Bias: In our study, we collect 78,455 months with 12 features of 1,628 GitHub projects data for the experiment. Also we use 7 GitHub development features as health indicators of open-source project. While we reach good predicting performance on those data, it would be inappropriate to conclude that our technique always gets positive result on open-source projects, or the health indicators we use could completely decide the project’s health status. To mitigate this problem, we release a replicable package of our entire experiment to support the research community to reproduce, improve or refute our results on broader data and indicators.
8. Conclusion and Future Work
Our results make a compelling case for open source development. Companies that only build in-house proprietary products may be cutting themselves off from the information needed to reason about those projects. Software developed on some public platforms is a source of data that can be used to make accurate predictions about those projects. While the activity of a single developer may be random and hard to predict, when large groups of developers work together on software projects, the resulting behavior can be predicted with good accuracy. For example, after building predictors for seven project health indicators, we can assert that usually (for 6/7 indicators), we can make predictions with less than 10% error (median values).
Our results come with some caveats. Some human activity is too random, for the law of large numbers. We know this since we cannot predict everything with high accuracy. For example, while we can predict how many issues will be closed, we were unsuccessful in building good predictions for how many will remain open. Also, to make predictions, we must take care to tune the data mining algorithms to the idiosyncrasies of the datasets. Some data mining algorithms rely on statistical properties that are emergent in large samples of data. Hence, such algorithms may have problems reasoning about very small datasets, such as those studied here. Hence, before making predictions, it is vitally important to adjust the learners to the idiosyncrasies of the dataset via hyperparameter optimization. Unlike prior hyperparameter optimization work (Menzies and Shen, 2016), our optimization process is very fast (seven seconds per dataset). Accordingly, we assert that for predicting software project health, hyperparameter optimization is the preferred technology.
As to future work, there is still much to do. Firstly, we know many organizations such as IBM that run large in-house ecosystems where, behind firewalls, thousands of programmers build software using a private GitHub system. It would be insightful to see if our techniques work for such “private” GitHub networks. Secondly, our results are good but not perfect. Table 6 shows that while our median results are good, some prediction tasks are harder than others (e.g. open issues, commits, and star). Also, Table 8 shows that further improvements are possible. The DE algorithm used in this paper is essentially Storn’s 1997 version and there are many more recent variants of that algorithm that could be useful (Wu et al., 2018; Das et al., 2016)
. Another thing to try here might be deep learning. Normally we might not recommend slow algorithms like deep neural networks for reasoning over 1,600+ projects. But since our datasets are relatively small, that there might be ways to short cut the usual learning cycle. For example, suppose we found that our 1,600+ projects cluster into (say) just a handful of different project types. In that case, the target for deep learning models could be very small and fast to process.
Lastly, the GitHub project health literature offers many more targets for this kind of reasoning (e.g. the programmer assessment metrics used by Bao et al. (Bao et al., 2019)). Our results seem to indicate that the law of large numbers could apply to GitHub. If so, then there should be many more things we can readily predict about open source projects (not just the targets listed in Table 4).
This study was partially funded by a National Science Foundation Grant #1703487.
- Co-evolution of project documentation and popularity within github. In Proceedings of the 11th Working Conference on Mining Software Repositories, pp. 360–363. Cited by: 8th item, 2nd item, §4.1.
- How to” dodge” complex software analytics. IEEE Transactions on Software Engineering. Cited by: §2, §3.3.
- Better software analytics via” duo”: data mining algorithms using/used-by optimizers. arXiv preprint arXiv:1812.01550. Cited by: §2, §3.3.
- Is” better data” better than” better data miners”?. In 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE), pp. 1050–1061. Cited by: §2, §3.3.
- Working for free? motivations for participating in open-source projects. International Journal of Electronic Commerce 6 (3), pp. 25–39. Cited by: §1.
- A practical guide for using statistical tests to assess randomized algorithms in software engineering. In 2011 33rd International Conference on Software Engineering (ICSE), pp. 1–10. Cited by: §4.3.
- Foundation. Cited by: §1.
- A large scale study of long-time contributor prediction for github projects. IEEE Transactions on Software Engineering. Cited by: §2, §2, §2, 8th item, §3.3, §8.
- Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13 (1), pp. 281–305. External Links: Cited by: §3.3.
- Algorithms for hyper-parameter optimization. In Advances in neural information processing systems, pp. 2546–2554. Cited by: 3rd item.
- A cross-repository model for predicting popularity in github. In 2018 International Conference on Computational Science and Computational Intelligence (CSCI), pp. 1248–1253. Cited by: 2nd item, 3rd item.
- Cost estimation with cocomo ii. ed: Upper Saddle River, NJ: Prentice-Hall. Cited by: 1st item, §2.
- Predicting the popularity of github repositories. In Proceedings of the The 12th International Conference on Predictive Models and Data Analytics in Software Engineering, pp. 1–10. Cited by: §2, 5th item, 2nd item.
- Understanding the factors that impact the popularity of github repositories. In 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME), pp. 334–344. Cited by: §4.1.
- Evaluating prediction systems in software project estimation. IST 54 (8), pp. 820–827. Cited by: §4.2, §4.2, §4.2.
- A new accuracy measure based on bounded relative error for time series forecasting. PloS one 12 (3). Cited by: §7.
- Predicting the number of forks for open source software project. In Proceedings of the 2014 3rd International Workshop on Evidential Assessment of Software Technologies, pp. 40–47. Cited by: §2, 8th item, 2nd item.
- Is this github project maintained? measuring the level of maintenance activity of open-source projects. Information and Software Technology 122, pp. 106274. Cited by: §4.1.
Empirical methods for artificial intelligence. MIT Press, Cambridge, MA, USA. External Links: Cited by: §4.2.
- Assessing the health of open source communities. Computer 39 (5), pp. 89–91. Cited by: §1, §4.1.
Recent advances in differential evolution–an updated survey.
Swarm and Evolutionary Computation27, pp. 1–30. Cited by: §3.3, §8.
- Differential evolution: a survey of the state-of-the-art. IEEE transactions on evolutionary computation 15 (1), pp. 4–31. Cited by: §3.3.
- Validity threats in empirical software engineering research-an initial survey.. In SEKE, pp. 374–379. Cited by: §7.
- A simulation study of the model evaluation criterion mmre. TSE 29 (11), pp. 985–995. Cited by: §4.2, §7.
- Community health analytics open source software https://chaoss.community/. Cited by: 3rd item.
- Why is differential evolution better than grid search for tuning defect predictors?. arXiv preprint arXiv:1609.02613. Cited by: §2, §3.3, §3.3, §3.3.
- Characterization and prediction of popular projects on github. In 2019 IEEE 43rd Annual Computer Software and Applications Conference (COMPSAC), Vol. 1, pp. 21–26. Cited by: §1, 8th item, 2nd item, §4.1, §4.1.
- An assessment model to foster the adoption of agile software product lines in the automotive domain. In 2018 IEEE International Conference on Engineering, Technology and Innovation (ICE/ITMC), pp. 1–9. Cited by: §1, §3.1.
- Measuring the health of open source software ecosystems: beyond the scope of project health. Information and Software Technology 56 (11), pp. 1508–1519. Cited by: §1, 1st item, §4.1.
- Surgical teams on github: modeling performance of github project development processes. Information and Software Technology 100, pp. 32–46. Cited by: 7th item, 2nd item, 3rd item.
- An in-depth study of the promises and perils of mining github. Empirical Software Engineering 21 (5), pp. 2035–2071. Cited by: §4.1.
- A systematic review of effect size in software engineering experiments. Information and Software Technology 49 (11-12), pp. 1073–1086. Cited by: §4.3.
- Using dynamic and contextual features to predict issue lifetime in github projects. In 2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR), pp. 291–302. Cited by: §2, 6th item, 2nd item.
- What accuracy statistics really measure. IEEE Software 148 (3), pp. 81–85. Cited by: §4.2, §7.
- Confidence in software cost estimation results based on mmre and pred. In PROMISE’08, pp. 63–70. Cited by: §4.2, §7.
- What is the connection between issues, bugs, and enhancements?. In 2018 IEEE/ACM 40th International Conference on Software Engineering: Software Engineering in Practice Track (ICSE-SEIP), pp. 306–315. Cited by: §3.1.
- Exact mean absolute error of baseline predictor, marp0. IST 73, pp. 16–18. Cited by: §4.2.
- Healthy or not: a way to predict ecosystem health in github. Symmetry 11 (2), pp. 144. Cited by: §2, 8th item, 2nd item, 3rd item, §4.1.
- Assessing open source project health. Cited by: §1, §4.1.
- Differences between traditional and open source development activities. In International Conference on Product Focused Software Process Improvement, pp. 131–144. Cited by: §1.
- Reviewing the health of software ecosystems-a conceptual framework proposal. In Proceedings of the 5th international workshop on software ecosystems (IWSECO), pp. 33–44. Cited by: §1, 2nd item, §4.1.
- Tuning for software analytics: is it really necessary?. IST Journal 76, pp. 135–146. Cited by: §2, §3.3, §3.3, §3.3, §6.1, §8.
- Ranking and clustering software cost estimation models through a multiple comparisons algorithm. IEEE Transactions on software engineering 39 (4), pp. 537–551. Cited by: §4.3.
- A review of software surveys on software effort estimation. In 2003 International Symposium on Empirical Software Engineering, 2003. ISESE 2003. Proceedings., pp. 223–230. Cited by: §7.
- A review of software surveys on software effort estimation. In Empirical Software Engineering, 2003. ISESE 2003. Proceedings. 2003 International Symposium on, pp. 223–230. Cited by: §4.2.
- Operational profiles in software-reliability engineering. IEEE software 10 (2), pp. 14–32. Cited by: 2nd item.
- Large-scale agile transformation at ericsson: a case study. Empirical Software Engineering 23 (5), pp. 2550–2596. Cited by: §1, §3.1.
- The top 10 adages in continuous deployment. IEEE Software 34 (3), pp. 86–95. Cited by: §1, §3.1.
- Scikit-learn: machine learning in python. the Journal of machine Learning research 12, pp. 2825–2830. Cited by: §3.3.
- Comparative studies of the model evaluation criterion mmre and pred in software cost estimation research. In ESEM’08, pp. 51–60. Cited by: §4.2, §7.
- Software effort estimation based on open source projects: case study of github. Information and Software Technology 92, pp. 145–157. Cited by: 8th item, 2nd item.
- Parametric measures of effect size. The handbook of research synthesis 621 (2), pp. 231–244. Cited by: §4.3.
- Investigating the adoption of agile practices in mobile application development.. In ICEIS (1), pp. 490–497. Cited by: §1, §3.1.
- Multi-objective software effort estimation. In ICSE, pp. 619–630. Cited by: §2, §2, §4.2, §4.2, §4.2.
- On building prediction systems for software engineers. EMSE 5 (3), pp. 175–182. Cited by: §4.2, §7.
- Can traditional fault prediction models be used for vulnerability prediction?. Empirical Software Engineering 18 (1), pp. 25–59. Cited by: §1.
- A further empirical investigation of the relationship of mre and project size. ESE 8 (2), pp. 139–161. Cited by: §4.2, §7.
- Personnel communication. Cited by: 2nd item.
Differential evolution–a simple and efficient heuristic for global optimization over cont. spaces. JoGO 11 (4), pp. 341–359. Cited by: Figure 1, §3.3, §3.3.
- Automated parameter optimization of classification techniques for defect prediction models. In Proceedings of the 38th International Conference on Software Engineering, pp. 321–332. Cited by: §2, §3.3.
- Monitoring the “health” status of open source web-engineering projects. International Journal of Web Information Systems. Cited by: §1, §4.1.
- Who will become a long-term contributor? a prediction model based on the early phase behaviors. In Proceedings of the Tenth Asia-Pacific Symposium on Internetware, pp. 1–10. Cited by: 8th item.
- What makes an open source code popular on git hub?. In 2014 IEEE International Conference on Data Mining Workshop, pp. 851–855. Cited by: §2, 4th item, 2nd item.
- Data mining: practical machine learning tools and techniques. 3rd edition, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. External Links: Cited by: §6.1.
- Ensemble of differential evolution variants. Information Sciences 423, pp. 172–186. Cited by: §3.3, §8.
- Assessing the health of an open source ecosystem. In Emerging Free and Open Source Software Practices, pp. 238–258. Cited by: §1, §4.1.
- Reviewer recommendation for pull-requests in github: what can we learn from code review and bug assignment?. Information and Software Technology 74, pp. 204–218. Cited by: §4.1.
- If you can’t measure it, you can’t improve it. https://www. linux.com/news/if-you-cant-measure-it-you-cant-improve-it-chaoss-project-creates- tools-analyze-software/. Cited by: §3.1.