The Early Bird Catches the Worm: Better Early Life Cycle Defect Predictors

by   N. C. Shrikanth, et al.

Before researchers rush to reason across all available data, they should first check if the information is densest within some small region. We say this since, in 240 GitHub projects, we find that the information in that data “clumps” towards the earliest parts of the project. In fact, a defect prediction model learned from just the first 150 commits works as well, or better than state-of-the-art alternatives. Using just this early life cycle data, we can build models very quickly (using weeks, not months, of CPU time). Also, we can find simple models (with just two features) that generalize to hundreds of software projects. Based on this experience, we warn that prior work on generalizing software engineering defect prediction models may have needlessly complicated an inherently simple process. Further, prior work that focused on later-life cycle data now needs to be revisited since their conclusions were drawn from relatively uninformative regions. Replication note: all our data and scripts are online at



There are no comments yet.


page 5

page 11


Early Life Cycle Software Defect Prediction. Why? How?

Many researchers assume that, for software analytics, "more data is bett...

Revisiting Process versus Product Metrics: a Large Scale Analysis

Numerous methods can build predictive models from software data. But wha...

Learning GENERAL Principles from Hundreds of Software Projects

When one exemplar project, which we call the "bellwether", offers the be...

On the Time-Based Conclusion Stability of Software Defect Prediction Models

Researchers in empirical software engineering often make claims based on...

Deep Learning Models in Software Requirements Engineering

Requirements elicitation is an important phase of any software project: ...

End-of-Life of Software How is it Defined and Managed?

The rapid development of new software and algorithms, fueled by the imme...

Deep reinforced learning enables solving discrete-choice life cycle models to analyze social security reforms

Discrete-choice life cycle models can be used to, e.g., estimate how soc...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.