Automated builds are integral to the Continuous Integration (CI) software development practice. As developers check in code to the shared repository, an automated system picks up the changes and triggers a build. The automated build will compile the code, and ideally, run a test suite. Build results notify developers about integration problems like compilation errors or missing dependencies. When combined with unit tests, build results can reveal broken or changed functionality in a software project.
In CI, developers are encouraged to commit their changes early and often. Changes that are smaller and more regularly integrated are easier to debug when something breaks . Thus, build times are very important when integrations are frequent. Long build times can become a bottleneck to the CI process.
Long build times are problematic for developers. Developers can lose focus and productivity while waiting for a build to finish. For example, developers may work on specific tasks in separate branches using a version control system. Should a build fail, it would be cumbersome to switch back to the original branch due to factors such as caching, configuration files, and so on. Additionally, developers may experience context switching when changing between different tasks. The cost of context switching can be low when the complexity of tasks is low. Conversely, when the complexity of tasks is high, context switching can be a costly expenditure of mental energy on the part of the developer . Therefore, it may be easier to stay in the current branch and wait for the build results to finish before moving on to a future task or continuing with the current job.
We are interested in finding a balance between integrating often and keeping developers productive. Our research goal is simple: to build a predictive model that can predict the estimated build time of a job. Our project takes advantage of TravisTorrent, a freely available dataset combining features from GitHub and Travis CI for builds of more than 1,000 projects.
Ii Related Work
Mokhov et al.  noted that build systems face issues when projects have a large number of components, multiple languages, and complex interdependencies. For example, build systems may have executables that depend on libraries or running tools that are generated by the same build system. As well, build systems can be hard to maintain when build environments are not managed. Employing tools such as the Nix package manager can help describe package build actions and their dependencies, allowing the build environments to be produced automatically .
The effects of long build times can be negative. Brooks  noted that long build times can affect the following variables: commit size, commit frequency, build down time, development flow, and developer satisfaction. Negative perceptions about waiting times can be lowered by providing feedback, controlling perceived waiting time, and having different waiting times for different tasks .
As a solution to slow builds, Ammons  suggested breaking large, all-or-nothing builds into many smaller builds. To demonstrate the technique, Ammons implemented a tool suite called Grexmk. Grexmk contains tools for dividing large builds into mini-builds and tools for executing mini-builds in parallel and incrementally. Moreover, a mini-build is an all-or-nothing build that explicitly lists its: output files, source files, dependences on other mini-builds, and build script. Overall, incremental builds were sped up by a factor of 1.2 when using Grexmk.
Brooks  suggested that a build time of 2 minutes was optimal; build times under 10 minutes were considered acceptable. However, the suggested build times were based on experience reports from the same company. In summary, there is a lack of empirical quantitative research to address optimal build times in a CI environment .
We selected TravisTorrent as the dataset for our prediction task. TravisTorrent is a synthesis of pull-request commits from software projects hosted on GitHub with TravisCI integrated as a mechanism for continuous integration . The dataset draws features from GitHub for a particular build job in a project as well as corresponding build features from TravisCI. The data contains over 2 million records spanning 1,000 projects.
Our paper uses this information to investigate the factors that affect the build time of a build job in a pull-request CI development ecosystem. As well, we will build a predictive model to estimate the build time of a particular build job given that a specified set of build job features is known.
Iii-B Response Feature
TravisTorrent has a feature called tr_duration
which is a vector containing the overall duration of the build in seconds. This includes both the time taken for the Travis build to start, the time taken to run tests, and finally the time it takes to run the build. We selectedtr_duration
as the response variable for prediction because this is the estimated total time the developer is inert as he waits for the build to complete.
Iii-C Initial Data Preparation
TravisTorrent contains 2,640,825 build records with 56 features. We first carry-out some rudimentary data preparation operations to get our data ready for analysis. The data is randomized to eliminate ordering in the data. Next we removed the records that contained no values (i.e. NA’s) for the overall build duration (tr_duration) feature. We then create a 70% to 30% split of our data into a training/cv set and test set (this would later be used to evaluate the model to ascertain generalizability to unseen sample points). The training/cv set contains 1,846,396 records while the test set contains 791,310 records.
Iii-D Initial Feature Selection
The dataset contains 56 feature vectors, of which 34 are integers/numeric, 16 are strings, 4 are booleans, and 2 are in the ISO date format. To begin building our predictive model, we selected all but 3 of the integer/numeric features. The features excluded were: unique records which indicated the unique identifier for each build job, the pull request number on GitHub, and the build number on Travis. We did not deem any of the string and date variables to be useful enough to be included as features in our predictive model. Finally, we considered all the boolean variables (which will be coded as factors) in this initial feature selection phase. This totaled 35 features selected for use in building the predictive model from the original 56 features. TableI lists the features considered for the prediction task.
Number of developers that committed directly or merged pull requests from the moment the build was triggered and 3 months back
|gh_num_issue_comments||If git_commit is linked to a PR on GitHub, the number of discussion comments on that PR|
|gh_num_commit_comments||The number of comments on git_commits on GitHub|
|gh_num_pr_comments||The number of comments (code review) on this pull request on GitHub|
|gh_src_churn||How much (lines) production code changed in the commits built by this build|
|gh_test_churn||How much (lines) test code changed in the commits built by this build|
|gh_files_added||Number of files added by the commits built by this build|
|gh_files_deleted||Number of files deleted by the commits built by this build|
|gh_files_modified||Number of files modified by the commits built by this build|
|gh_tests_added||Lines of testing code added by the commits built by this build|
|gh_tests_deleted||Lines of testing code deleted by the commits built by this build|
|gh_src_files||Number of src files changed by the commits that where built|
|gh_doc_files||Number of documentation files changed by the commits that where built|
|gh_other_files||Number of files which are neither production code nor documentation that changed by the commits that where built|
|tr_tests_ok||Number of tests passed|
|tr_tests_fail||Number of tests failed|
|tr_tests_run||Number of tests were run as part of this build|
|tr_tests_skipped||Number of tests were skipped or ignored in the build|
|tr_testduration||Time it took to run the tests|
|gh_test_lines_per_kloc||Test density. Number of lines in test cases per 1000 gh_sloc|
|gh_test_cases_per_kloc||Test density. Number of test cases per 1000 gh_sloc|
|gh_asserts_cases_per_kloc||Assert density. Number of assertions per 1000 gh_sloc|
|gh_description_complexity||If gh_is_pr, the total number of words in the pull request title and description|
|tr_num_jobs||How many jobs does this build have (lenght of tr_jobs)|
|gh_commits_on_files_touched||Unique commits on the files included in the build from the moments the build was triggered and 3 months back|
|gh_sloc||Number of executable production source lines of code, in the entire repository|
|tr_setup_time||Setup time for the Travis build to start|
|tr_purebuildduration||Time it took to run the build (without Travis scheduling and provisioning the build)|
|git_num_committers||Number of people who committed to this project|
|tr_ci_latency||Latency included by Travis (scheduling, build pick-up, …)|
|gh_is_pr||Whether this build was triggered as part of a pull request on GitHub|
|tr_tests_ran||Whether tests ran in this build|
|tr_tests_failed||Whether tests failed in this build|
|gh_by_core_team_member||Whether this commit was authored by a core team member|
|tr_duration||Overall duration of the build|
Iii-E Evaluation Metrics
Furthermore, we select an evaluation criteria to evaluate how well our model is performing in predicting build times from the learned dataset. This will also enable us to compare various algorithms to have an idea on which algorithm is performing better. The metric scores influence our options of what to pursue next to improve prediction accuracy. The metric used for the task is Root Mean Square Error (RMSE) and (R-Squared).
RMSE reports the mean deviation of the predicted value from the original value. This gives us an idea of how well our algorithm is performing relative to the original value. The lower the RMSE, the better the prediction accuracy. The unit of RMSE is determined by the unit of the response variable, where in this case, it is in seconds, which is the unit of measurement for the total duration of a build job (tr_duration).
gives us a measure of how much of the variation in predicted values is explained by the model. The values of
ranges between 0 and 1. Values close to zero signify that a large proportion of variability in the result is unexplained by the model, while values close to one indicate that most of the variance in the result is accounted for by the model. We look for values closer to one to ensure the robustness of our model when encountered with new, unseen data.
Iii-F Initial Feature Scaling - Data Standardization
The selected features for our prediction model are standardized to ensure that data values are in the same range. These features benefit some of the regression machine learning algorithms and instance based methods when evaluating the distance between points. Thus, we apply the center and scale standardization measure as a data pre-processing procedure.
Iii-G Rationale for Algorithm Selection
We sample a set of linear and non-linear algorithms that work on regression problems to get a baseline performance on model accuracy. In doing this, we use 10-fold cross validation with 3 repeats. This CV procedure splits the training set into 10-folds with each selected algorithm running 10 times on 90% of the data, using the remaining 10% to assess model performance. This process is repeated 3 times to produce an unbiased estimate of the algorithm performance. This also prevents over fitting the model by capturing noise from the data, which inherently leads to poor generalizability in predicting an out-of-sample build job time.
For this problem we sample the following Linear and Non-Linear Supervised Machine Learning algorithms to spot-check initial baseline results. For Linear models, we sample the following algorithms, Linear Regression (LR), Partial Least Squares (PLS), Penalized Linear Regression (GLMNET), and Least Angle Regression (LARS). For Non-Linear models, we sampled Classification and Regression Trees (CART), Support Vector Machines (SVM) with a radial basis function, k-Nearest Neighbors (KNN), and Neural Network (Nnet). We then spot-check a set of Ensemble methods such as Bagged Classification and Regression Trees (BCART), Random Forest (RF), Stochastic Gradient Boosting (SGB), and Cubist (CB) models.
The Linear models were selected because of their propensity to provide surprisingly good prediction results even when the inherent structural form of the data is non-linear. With enough data entries, linear models surprisingly perform well, sometimes out-performing or even equaling the performance of their non-linear counterparts on non-linear datasets. The non-linear models were selected because we understand the structure of the underlying relationships between features in the data to be non-linear. With a non-linear structure, a non-linear algorithm will be favored to perform better in prediction accuracy assessments. Finally, ensemble methods combine the outputs of various algorithms to get a better prediction score on unseen data. We sample a few of them here because ensemble methods are known to give good accuracy measures in various prediction tasks.
Iii-H Computational Tools
Due to the computationally expensive nature of the project, we employed OpenStack, a cloud computing infrastructure as a service platform to run our algorithms to leverage the advantages of multicore parallelization. Our OpenStack configuration consisted of 20 CPUs which hosted a Linux distro of our computing tools.
We made use of the R statistical programming environment as the major tool for our analysis. R was selected as our tool of choice because of its robust, open-source machine learning packages. R has won large accolades in the area of predictive analytics. Many data scientists and machine learning engineers use R as their preferred tool of choice for predictive analytics . We primarily made use of the caret package, among other key packages used for our work. Caret is short for “Classification and Regression Training” which contains a plethora of functions that simplify the process of training and testing machine learning models for regression and classification problems.
To speed up the data processing time, we ran our algorithms on a subset of 10,000 records from the training set to get our initial baseline results on how the learning algorithms are performing. A seed was set to ensure reproducibility consistency. The results are shown in Table II.
|RMSE||Min.||1st Qu.||Median||Mean||3rd Qu.||Max.||NA’s|
|Rsquared||Min.||1st Qu.||Median||Mean||3rd Qu.||Max.||NA’s|
From the results provided in Figure 1, Cubist (CB) model has the lowest RMSE of 4,052 seconds, followed by Random Forest (RF) with 4,145 seconds. While Cubist (CB) model and Stochastic Gradient Boosting (SGB) also have the highest and second-highest value of 0.7808 and 0.7742 respectively. An average baseline difference of 4,052 seconds is the accuracy measure to beat in order to improve the model. From experience this CV RMSE value is likely to increase when tested on unseen data. We explored other predictive analytic techniques to see if we can get a better prediction model with a lower RMSE and a higher .
Iv-B Drop Highly Correlated Features
In machine learning practice, it is observed that features with high correlation can have an adverse effect on the predictive model accuracy . Hence, we employ this technique to prune out features that are highly correlated. To do this, we set a correlation cut-off score of 0.70 (i.e. values about 0.70 or below -0.70) to indicate highly correlated variables. We employ the findCorrelation() function of the Caret package to find and remove highly correlated features. We use the subset of 10,000 records from the training set to compute the correlation matrix. The features gh_src_files, tr_tests_ok, gh_test_cases_per_kloc, and gh_test_lines_per_kloc had a correlation index greater than 0.70 or less than -0.70, hence they are removed from the model. This reduced the number of features to 31.
However, we did not get a noticeable improvement in the prediction accuracy by taking out this attributes. In Table III are the mean values of the RMSE of error measures.
Iv-C Recursive Feature Selection (RFE)
RFE is an automatic method of selecting features for a predictive model based on their relative importance. RFE is defined as a wrapper method, this is because of the way it samples the features, as it analyzes the interactions between increasing subsets of features to determine the relative impact of each feature in the presence of others. This can also be viewed as a brute-force method; it is computationally expensive. We used the random forest implementation of RFE to search and identify the best feature space that will improve our model accuracy. The results of implementing this automatic selection method is shown in Figure 1.
From Figure 1, we hope to achieve a slightly better model by reducing the feature space to 28 variables. However, the mean prediction difference was not statistically significant.
Iv-D Boruta Feature Selection Method
We also employed another wrapper technique called Boruta, to see if we can squeeze out an improvement on our prediction accuracy on the CV dataset. Boruta is another automatic wrapper method that uses random forest in its algorithm to determine the relative importance of features with respect to the response variable . We achieved similar results to the benchmark after running this algorithm and testing the new dataset of 29 features across our selected learning algorithms.
Iv-E Applying Box-Cow Power Transforms
We applying the Box-cow power transforms, this further normalizes our data features to approximate a Gaussian distribution. This data transformation technique is heuristically known to improve prediction accuracy across various “linear” machine learning models that perform better on normalized data. However, this transform has its biggest improvements across our linear models. It has no effect on on overall accuracy threshold because linear models are generally performing very poorly for this problem as previous results have shown.
Iv-F Using Principal Component Analysis
Principal Component Analysis is another data pre-processing technique that we implemented to extract important features from our dataset. This method is relevant when you have a high dimensional dataset, and you need to scale down the feature set to the ones that contain just the information you need to optimize prediction accuracy. Before applying principal component analysis, we first normalized our features (i.e. applying center and scale operations) to have them on the same scale. Unfortunately, the results of applying PCA did not yield any significant improvement on our CV accuracy. The results are shown in Figure 2 below.
Iv-G Test Set Accuracy on Selected Models
Finally, we applied our baseline models on unseen data (i.e. the test set) to see how our model performs on out-of-sample data points. Usually, we expect our test set errors to predict slightly worse than our CV set error estimates. This is because our CV set can sometimes give us overly-optimistic prediction values.
The test set was constructed by sampling 10,000 unseen records from the original 791,310 records. We used each of the final models to predict the build duration of the test data. The model estimates were compared to the original tr_duration value of the test set using RMSE and
Table IV provides a tabular summary of the test-set prediction accuracy. This is the result of predicting the different models on unseen data. As expected from the results on the CV prediction accuracies, Cubist (CB) and Random Forest (RF) are outperforming the others with a lower RMSE and higher . This metric is particularly very encouraging, because it shows us that a high percentage of variance in the prediction accuracies is accounted for by the model. Figure 4 & 3 presents a graphical view of the test-set RMSE and in a dotplot.
V-a Implications of Study
Wallace et. al  suggests that as the number of developers increases, the project size and complexity also widens. This is turn escalates the wait time of a build job. Also, frequent integration follows that a minimal number of changes in software artifacts (e.g. lines of code changed, files added, files deleted, files modified, etc.) are constantly integrated into the main code base.
Developers tend to lose focus and productivity while waiting for code to build . This has a detrimental effect to the idea of continuous integration which advocates for frequent builds to commit changes to the central code base fast and early. However, as the projects expands, frequent integration can become a huge impedance to productivity. For example, if a build takes approximately 1 hour or more, integrating more than once a day can become a setback to developer speed and efficiency.
From the related work, numerous techniques have been discussed to balance the trade-off between build waiting time and the need to continuously integrate code. Our research project comes in the middle to further balance this trade-off. We developed a predictive model to approximate the build time of a build job in a CI environment.
Software Developers. When the approximate time for a build is known, a developer can be more intentional with how they spend their time. In turn, the developer will improve their efficiency and productivity. For example, if a build is expected to take a long time, then the developer may move onto other tasks such as responding to e-mails or code reviews. As a whole, the overall quality of their project will improve.
Project Managers. Knowing the approximate build time beforehand can be advantageous to project managers (and management in general). Project managers are able to strategize on how best to manage the CI process to balance the anticipated build wait time versus the need to continuously commit changes to the central code base. This pre-knowledge will be particularly crucial for large projects. Project managers can explore different variables to ascertain the best setup for a build job that will keep the build wait time within an acceptable level, given the peculiar circumstance of the project. Teams can be aware of the minimum amount of changes that must be implemented before a build job is triggered.
Software Organizations. Software organizations looking to adopt CI can make use of the prediction model. As organizations transition to CI practices, they may begin to approximate their build times. Making process changes that favourably reduce build times may be easier when CI practices are not fully implemented within an organization. The lack of constraints from a CI environment may allow organizations to quickly make changes to reduce build times before CI is eventually implemented.
Researchers. Finally, other researchers can use and modify the prediction model to further our understanding of build times. Researchers can further study the variables and relationships affecting build times. The prediction model could be expanded to include a dataset other than TravisTorrent.
Our research has a direct and immediate relevance to the industry, and further strengthens the concept of continuous integration, which consequently gives rise to continuous delivery in the widely industry embraced agile development model.
V-B Limitations of Study
Our study had several limitations that mitigated further enquiry into the improvement of our predictive model. The main obstacle was the sheer computation power required to perform our analysis. The TravisTorrent dataset contains approximately 1.76GB of data with more than 2 million observations.
Additionally, in this problem, we considered a variety of learning algorithms, which comprised of a combination of linear, non-linear, and ensemble methods. Some of the learning algorithms are computationally expensive: this is especially true of kernel methods such as support vector machines and ensemble methods like random forest and stochastic gradient boosting. These algorithms took over 8 hours each for a single run. This is taking into consideration the fact that our original dataset was sub-sampled from 1,846,396 records to 10,000 records for the training set.
We ran into issues with R, the computational tool that we employed for our analysis. We encountered a lot of technical issues when running R on our dedicated OpenStack cluster of 20 cores; thus our tasks were prone to frequent crashing. We were forced to restart our R sessions multiple times, sometimes even after running an algorithm for over 8-10 hours. R in our experience for this research project does not have a very mature parallelization framework. We believe that at this point in time, R is unsuitable for multi-core, high performance data analytic computing.
The data and code used in this research has been made publicly available. This is to enhance reproducibility and further mining enquiry to improve on our presented results. We have made this data available in the spirit of open research and collaborative mining enquiry. The authors are not perfect, we however hope that constructive criticisms, revisions and remodeling can be made on our work.
V-D Further Recommendations
Base R does not scale well as a tool for parallel computing with big data. Hadoop would do a better job in this area, as indeed this is what Hadoop was primarily built for. Furthermore, R operates on data from RAM, and it can run out of RAM space quickly when carrying out expensive computations. On the other hand, Hadoop works with data stored on disks which is usually more in supply. Due to time constraints we could not fully explore working with Hadoop and its MapReduce operations to crunch our data.
Typically, the predictive accuracy of a model is improved when more data is available. We could not take advantage of the copious amounts of data available to train our model for obvious reasons. Hadoop is strongly recommended by the authors for further exploration.
V-E Demo App to Motivate Integration as an Industry Tool
In conclusion, we hope the features space can be further simplified to a set of variables that can be implemented as a tool in industry to predict the build time of various GitHub projects running Travis as a CI platform. To motive that desire, we have created a sample application to further communicate that thought. The application can be viewed at https://dvdbisong.shinyapps.io/BuildTimePredictor/. This tool can be implemented as an IDE plugin or as an add-on in a planning/scheduling software for developers. Figure 5 shows a screenshot of the application page.
In building the sample application, we used a trained Cubist model. We formulated a sample test by receiving as input the values of team size, lines of production code changed, test code changed, files added, deleted or changed and the number of jobs contained in the build. The remaining 30 variables were estimated by using the means of the values in the test data sample. Hopefully a future study can reduce the feature space to a small set of relevant features that will minimize RMSE and maximize the error metric.
We would like to thank Professor Olga Baysal for teaching COMP 5900: Mining Software Repositories, for which we were fortunate to have taken. Through weekly presentations as well as back-and-forth discussions, we gained a lot of insight about the software development industry. We believe that there is much more that can be discovered!
We would also like to thank Andrew Pullin for his assistance in helping us set up our computing environment. Andrew laid out some best practices and helped us set up our environment OpenStack.
-  G. Ammons. Grexmk: Speeding up scripted builds. In Proceedings of the 2006 International Workshop on Dynamic Systems Analysis, WODA ’06, pages 81–87, New York, NY, USA, 2006. ACM.
-  M. Beller, G. Gousios, and A. Zaidman. Travistorrent: Synthesizing travis ci and github for full-stack research on continuous integration. In Proceedings of the 14th working conference on mining software repositories, 2017.
-  G. Brooks. Team pace – keeping build times down. In Agile, 2008. AGILE ’08. Conference, pages 294 – 297. IEEE, 2008.
-  E. Dolstra and E. Visser. The nix build farm: A declarative approach to continuous integration. 2008.
-  P. Domingos. A few useful things to know about machine learning. volume 55, pages 78–87. ACM, 2012.
-  P. Kainulainen. The cost of context switching. https://www.petrikainulainen.net/software-development/processes/the-cost-of-context-switching.
-  M. B. Kursa, W. R. Rudnicki, et al. Feature selection with the boruta package, 2010.
-  E. Laukkanen and M. V. Mantyla. Build waiting time in continuous integration: an initial interdisciplinary literature review. In RCoSE ’15 Proceedings of the Second International Workshop on Rapid Continuous Software Engineering, pages 1–4. IEEE Press Piscataway, NJ, USA, 2015.
-  M. Meyer. Continuous integration and its tools. In IEEE Software, volume 31, pages 14–16, 2014.
-  A. Mokhov, N. Mitchell, S. Peyton Jones, and S. Marlow. Non-recursive make considered harmful: Build systems at scale. In Proceedings of the 9th International Symposium on Haskell, Haskell 2016, pages 170–181, New York, NY, USA, 2016. ACM.
-  T. Proietti and H. Lütkepohl. Does the box–cox transformation help in forecasting macroeconomic time series? International Journal of Forecasting, 29(1):88–99, 2013.
-  A. V. C. team. Practical guide to principal component analysis (pca) in r & python. https://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-python/.
-  A. Vance. The New York Times data analysts captivated by R’s power. http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html.
-  L. Wallace and M. Keil. Software project risks and their effect on outcomes. Communications of the ACM, 47(4):68–73, 2004.