It is accepted wisdom that maintenance dominates software development costs (LewisModernizingLegacy2003, ), with bug-handling being a major contributor (Sutherland:1995:BOC:210376.210394, ; Jorgensen:2007:SRS:1248721.1248736, ). The effort required for handling bugs (including locating and fixing the faulty code, and updating the test suite as a result) is likely to be impacted by the programming languages the software is built with (pajankarpython, ). However, which or which category of languages performs better with respect to bug-handling has long been debated in industry and academia alike. For example, believers of static typing argue that static languages tend to result in better software quality and lower bug-handling cost, because type checking is an effective way of tracking bugs (dynbadone, ; dynbadtwo, ). On the other hand, advocates of dynamic typing hold the belief that the ease of reading, writing, and understanding of dynamic languages would make bugs easier to find and fix (nierstrasz2005revival, ).
There has been a number of previous attempts to tackle this confusion. For example, Bhattacharya et al. (bhattacharya2011assessing, ) analyzed four open-source projects developed in C and C++; Kleinschmager and Hanenberg et al. (kleinschmager2012static, ; hanenberg2014empirical, ) compared bug-handling time for Java and Groovy. However, these work only check a small number of subjects, and mostly focus on pair-wise language comparisons, which cause threats to the reliability of their results.
This paper presents a systematic large-scale comparison of bug-handling effort among different programming languages. Our work is different from the previous work in the following aspects. First, we perform a comprehensive study of popular languages using a large number of projects. We choose 10 popular languages according to various rankings as our target languages and 600 projects (summing up to 70,816,938 SLOC and 3,096,009 commits). Second, we adopt a variety of measurement metrics (instead of one used in previous work): the (absolute and relative) amount of lines of modification, bug-handling time, and files of modification. Third, we take special care in removing or reducing the threats on the result validity: 1) we adopt a range of statistical analysis approaches and treat influential factors as control variables; 2) we use the median values among a large number of projects and commits, which are considered “less affected by outliers and skewed data”, to remove the bias caused by extreme circumstances(bissyande2013got, ; wonnacott1972introductory, ; meanandmedian, ); 3) we manually check the data we analyzed so as to make sure that our experimental setup is reasonable.
It is worth pointing out that the relationship between programming languages and bug-handling effort can be extremely complicated, potentially affected by many factors. For this reason, we perform correlation analysis rather than causal analysis in this work. When we say a language, we refer to the whole ecosystem of the language, including tool supports, developer experience, homogeneity of the code base and programming styles, adherence to or violation of best practices, the maturity of community, and so forth. When we say bug-handling effort, we refer to measurable criteria including the (absolute and relative) amount of line modification, bug-handling time, and file modification.
The results may impact the current software engineering practices in multiple ways. For example, for developers who care about bug-handling effort, they now have a more objective reference for choosing languages; the same goes to managers who plan and schedule projects. On a more technical aspect, automatic program repair has been an area of growing popularity (arcuri2008novel, ; hansson2015automatic, ; arcuri2008automation, ). Our results may provide hints on whether some languages typically require larger patches or more file modification, thus requiring larger search space for finding proper patches. Moreover, languages requiring high bug-handling effort may benefit more from automatic debugging, and thus could be better targets for such research.
These conclusions may not be fully generalized to imply the underlying causality. Indeed, such a limitation also exists in previous studies that analyze the relationships between programming languages and the characteristics of the software built with them (bhattacharya2011assessing, ; kleinschmager2012static, ; hanenberg2014empirical, ; steinberg2011impact, ; nierstrasz2005revival, ; tratt2009dynamically, ; sanner1999python, ; oliphant2007python, ). The derived guidelines may not be thoroughly interpreted or actionable, but can still provide suggestions to developers and researchers.
Specifically, the main contributions of this paper are threefold.
(1)A systematic and extensive study on bug-handling effort among different programming languages. We perform a comprehensive study of 10 popular languages, and adopt a variety of measurement metrics to measure bug-handling effort. We analysis the threats on the result validity and take actions to remove or reduce them.
(2) Empirical evidence that Java requires more line/file modification and less bug-handling time, while Ruby requires less line/file modification and more bug-handling time. Static and strong languages tend to require less bug-handling time.
The remaining parts of this paper are organized as follows. Section 2 motivates our work by introducing the current status of online debates over the bug-handling effort among different programming languages. Section 3 presents the details of our experiment design; Section 4 introduces the findings as well as the corresponding analysis; Section 5 discusses the implications of our findings on developers, managers, and researchers. Section 6 discusses the threats to validity and our efforts in reducing them. Section 7 introduces the related work. Section 8 concludes the paper.
In this section, we highlight the current status of confusion by presenting the contrasting views of practitioners as well as researchers to further motivate our work. The aim is to highlight the existence of the debate (which motivated us), rather than presenting a comprehensive survey. Therefore, we select the most representative online discussions and most related academic studies.
We surveyed the online discussions111“Online” refers to the views of practitioners published through blogs, forums, homepages, QA websites, and so on. on the topic by googling with the following query: “programming languages” + “maintai-
nabilitybug-handling effortbug-fixing effort”. We collect all the views of the first three pages returned (on 05/01/2017). Due to space limit, we only give a snap-shot of these results, shown in Table 1. The full results are on our homepage(link omitted to preserve anonymity). Column “Link” refers to different online sources, and Column “Effort” indicates the views expressed (whether the bug-handling effort is high or low for the (category of) language). For example, from the first two rows, three websites (webbackendlang, ; csharpgood, ; dyngoodthree, ) contain the view that dynamic languages have lower bug-handling effort, while some others (dynbadtwo, ; dynbadone, ; dynbadthree, ) contain the opposite view. Similarly, from the following two rows, three websites (pajankarpython, ; pythongood, ; dyngoodthree, ) contain the view that Python have lower bug-handling effort, whereas some others (pythonbadtwo, ; manylang, ) contain the opposite view.
|Category||dynamic languages||(webbackendlang, ; csharpgood, ; dyngoodthree, )||low|
|(dynbadtwo, ; dynbadone, ; dynbadthree, )||high|
|Language||Python||(pajankarpython, ; pythongood, ; dyngoodthree, )||low|
|(pythonbadtwo, ; manylang, )||high|
|C#||(manylang, ; csharpgood, )||low|
The inconsistency of opinions is also shared in academia. Bhattacharya et al. (bhattacharya2011assessing, ) statistically analyzed four open-source projects developed in C and C++. They measured maintainability by the number of lines modified during bug-handling and found that the move from C to C++ results in improved software quality and reduced maintenance effort. Kleinschmager and Hanenberg et al. (kleinschmager2012static, ; hanenberg2014empirical, ) compared the bug-handling time for Java and Groovy. Their results indicate Groovy, which belongs to dynamic languages, requires more time in bug-handling, and concluded that static types are indeed beneficial in reducing bug-handling time. Steinberg (steinberg2011impact, ) found that the static typing has a positive impact on debugging time if only non-type errors are considered.
On the other hand, some researchers are against the use of static languages. Nierstrasz et al. (nierstrasz2005revival, ) described static languages as “the enemy of change”, claiming that dynamic languages are easier to maintain. Tratt et al. (tratt2009dynamically, ) also mentioned that compared to dynamic languages, static languages have higher development cost and require more complex changes. Sanner et al. (sanner1999python, ) described Python as a “smaller, simpler, easy to maintain, and platform independent” language due to its dynamic typing features. Oliphant et al. (oliphant2007python, ) gave a similar verdict.
A common characteristic of these existing academic studies is that they are all of a small scale, mostly focusing on pair-wise language comparisons aiming at isolating the effect of certain language features. These studies use only a small number of subjects, and mostly one measurement criterion of bug-handling effort. In contrast to these studies, in this paper we aim to look at the big picture, by considering a range of programming languages, bug-handling-effort measurement criteria, and analysis approaches. We also use a large number of projects to reduce bias and derive more reliable results.
3. Experimental Setup
This study is designed to answer the following research questions.
What is the bug-handling effort of different languages?
What is the bug-handling effort of different language categories?
Do application domains impact the comparison results of different languages?
Does considering programming languages improve the accuracy of bug-handling-effort prediction?
Note that RQ1, RQ2, and RQ3 focus on comparing bug-fixing effort across programming languages. RQ4 explores the feasibility of using our results in a specific context. The presented model in RQ4 is a proof-of-concept, not intended to be practical.
3.1. Target Programming Languages
As in previous work (ray2014large, ), we categorize the languages according to two well-known classifications, namely compilation and typing, as shown in Table 2. The compilation classification divides a target language into dynamic or static categories depending on whether types are checked dynamically during run time or statically during compilation. The type classification divides a target language to strong typing and weak typing depending on how strictly types are distinguished (ray2014large, ). We call statically and dynamically checked languages “static languages” and “dynamic languages”, and call strong and weak typing languages “strong languages” and “weak languages”. Note that as the main aim of this work is to study popular languages, we do not seek comprehensive coverage of classification. For example, the language list does not include any predominately functional language.
3.2. Subjects and Control Variables
All our subjects are open-source projects from GitHub (githubhomepage, ). For each target language, we retrieve the project repositories that are primarily written in that language, and select 60 most popular projects based on their number of stars, as in prior work (starpopular, ; ray2014large, ; bhattacharya2011assessing, ).
Figure 1 presents the density distribution of the basic information of all the projects. Four types of information are presented: 1) SLOC: the physical executable lines of code, which is calculated by the tool CLOC (clochomepage, ). For multi-language projects, we only adopt the SLOC of the primary language reported by CLOC. 2) #Commit: the total number of commits downloaded from the GitHub API (githubapihomepage, ). 3) Age: the age (year) of each project. We use the time T12:00:00Z, May 4, 2017 to minus the creation time (stored in the GitHub API) of each project to get its age (hours). 4) #Contributor: the number of contributors, which is also collected through the API.
From the figure, the projects of different languages tend to have different sizes, ages, and so on, which may influence the bug-handling effort. As in previous work (ray2014large, )
, we consider them as control variables in regression analysis (see more in Section3.4).
Density distribution of subjects. Each language has a violin plot, which shows the probability density of the data at different values using the plot width. For example, in the last figure showing the number of contributors, mostJava and Objective-C projects have a small number of contributors.
3.3. Measurements for Bug-Handling Effort
To measure bug-handling effort, prior work used amount of line modification (bhattacharya2011assessing, ) or bug-handling time (kleinschmager2012static, ) as criteria. But since bug-handling effort is complex to measure, using only a single metric is likely to introduce bias. To relieve this problem, in this paper we measure bug-handling effort of a language in terms of three aspects: the amount of line modification, bug-handling time, and the amount of file modification. Since the size of a project may impact the above measurements, we consider both absolute and relative number of each criterion. Thus, in all, we use six measurement criteria as follows, where and refer to the total number of lines of code and the total number of files (of the primary language) for a project. As mentioned above, we do not expect any single one of the measurements alone to be sufficient in reflecting bug-handling effort accurately. Instead, we aim to achieve a higher level of confidence by having multiple of them complementing each other.
: the absolute number of modified lines of code.
: the relative number of modified lines of code, i.e., .
: the absolute time for handling a bug.
: the relative time for handling a bug, i.e., .
: the absolute number of modified files.
: the relative number of modified files, i.e., .
For each project, we collect the amount of line/file modification during bug handling by analyzing its commits, as in prior work (ray2014large, ; kamei2013large, ). In particular, we search the commits with messages containing both “fix” and “bug” (case insensitive), and treat them as bug-handling commits222We do not use other error-related keywords like “issue”, “mistake”, or “fault”, because from our observation, these keywords are also widely used to describe problems unrelated to source code (e.g., problems in documents) and may pollute the screening results.. Then we count the number of modified program files as well as the number of modified lines belonging to the project’s primary language, so as to calculate , , , and .
When a bug get fixed, a range of files may be modified/updated. In our measurement, we exclude non-code modifications such as documentations, but count all code changes of both source and test programs. This choice is deliberate, as we believe testing is an integrated part of development, and the effort involved in updating test code is naturally part of bug-handling and is language dependent. One obvious threat is that some bug-handling commits may contain code modification unrelated to the bug, particular refactorings which are likely to affect a disproportionally large amount of code. To check the severity of this bias, we manually analyze 585 randomly-chosen bug-handling commits from all our projects, and found that only 10.6% commits involve dealing with more than a single bug or other form of code modifications, showing a high level data integrity. To further reduce this bias, for each project, we use the median value of all the bug-handling commits to represent the project’s general level of line/file modification amount, which is “less affected by outliers and skewed data” (bissyande2013got, ).
For each project, we acquire the time spent on bug handling by analyzing issue reports, as prior work did (bissyande2013got, ). Note that we do not use commit information here as it only gives us the end time, not the corresponding start time of bug-handling. Instead we search the issue tracking system for closed issues with labels containing “bug” (case insensitive), and extract information from them. Inspired by the work of Zheng et al. (Zheng:2015:MIC:2786805.2786866, ), we define the handling time of each bug as the interval between the issue creation time and the time of the last comment, which is proven to be more accurate (than the interval between creation and closing time, which most previous work adopted (Zheng:2015:MIC:2786805.2786866, )). Again, we use the median of all the time as a representation of the typical level of a project’s bug-handling time so as to remove the impact of extreme values.
3.4. Statistical Analysis Used in the Study
We statistically analyze the experimental results from different aspects to improve the analysis reliability. If the conclusions are consistent, it is highly likely that the results are reliable.
First, since we collect a sufficient number of projects for each programming language, we directly present the density distribution of their bug-handling effort for each language (kampstra2008beanplot, ) and make comparison. For example, if most projects of language have lower , then is likely to need less bug-handling time.
Second, we use the median value to represent a language’s central tendency of bug-handling effort (srinivasan2007new, ; bissyande2013got, ; wonnacott1972introductory, ; meanandmedian, ), and rank the languages with it, as it is known that median values are better than average values in avoiding the bias of outliers (i.e., extreme values that differ greatly from other values) (bissyande2013got, ; wonnacott1972introductory, ; meanandmedian, ).
Third, we use multiple linear regression to indicate the contribution of different languages to bug-handling effort(ray2014large, )
. The comparison between the bug-handling efforts among different languages can be regarded as a importance determination problem of categorical variables, and thus we can use multiple regression to identify which languages contribute more to the effort values. Through multiple regression, each language has a coefficient, with higher coefficients indicating more bug-handling effort. Besides coefficients, we also present the results of 1) p-value: a low p-value (
0.05) indicates the rejection of the null hypothesis(westfall1993resampling, )
. 2) t-value: a statistic that measures the ratio between the coefficient and its standard error. It is a statistical test value(winer1971statistical, ). 3) standard error: it represents the average distance that the observed values fall from the regression line. 4) R-squared value: it represents how well the data fit the regression model.
3.5. Experimental Procedure
The experimental procedure of this study can be divided into data collection and data analysis.
3.5.1. Data Collection
Firstly, we collect the bug-handling-effort data of projects in various programming languages, which are used for further analysis.
Step 1. Information retrieval from GitHub API. GitHub API provides comprehensive information on commits, issues, and project history. For commits, we download all the JSON files of commits, which contain commit messages, the number of line additions and deletions, file changes, and so on. To compute bug-handling time, we download the JSON files of issues, which contain issue title, labels, state, creation time, close time, and the times of every comments. Due to the restriction of GitHub API access (5,000 times per hour), we skip the projects33316 projects are skipped. with very large commit history (which cannot be downloaded within 24 hours).
Step 2. Extraction of related information. As described in Section 3.3, we identify bug-handling commits and bug-handling issues through keyword searching. Some projects contain multiple languages, for which we only extract changed code belonging to their primary languages. Specifically, we use file extensions (e.g., “.java” for Java language) to identify relevant changes.
Step 3. Sanity check. We observed that the“most-popular” criterion implies good general metrics such as #issues, #developers, and #commits (1 project has fewer than 10 issues; 6 have fewer than 20 commits). Therefore, we focused on sanity metrics specific to our measurements: when checking bug-fixing line/file modification, we removed projects with no bug-fixing commit (65 removed), and chose 50 per-language from the remaining; when checking bug-fixing time, we removed projects with no bug-fixing issues (137 removed), and chose 35 projects per-language.
3.5.2. Data Analysis
After collecting the data, to answer RQ1, we use violin plots (hintze1998violin, ) to present the distribution of bug-handling effort across projects, then rank the languages based on the median values of all the projects of a language. Also, we calculate the multiple regression results as discussed in Section 3.4. Finally, we combine the median-value and multiple-regression analysis results by adding up the rankings from the different analysis approaches for each language. For example, a language ranking the 3rd in median-value analysis and the 5th in multiple-regression analysis will have a total rank of 8 after the combination.
To answer RQ2, we conduct regression analysis using language categories instead of languages and compare their coefficients.
To answer RQ3, we follow previous work (ray2014large, )
in manually classifying projects444To reduce the bias of manual classification, two authors classify all the projects separately, and then a third author re-classify the projects with conflicting classification. into seven domains (as shown in Table 3). For each domain, we delete the languages with no more than five projects and re-perform multiple regression with the remaining projects. We then compare the rankings within each domain with the ranking across all domains, and check whether some languages have better/worse performance in specific domains.
Finally, to answer RQ4, we build a toy classification model to predict whether a project has high, medium, or low bug-handling time. We compare the effectiveness of this predictive model with and without using programming languages as a feature, and check if the prediction accuracy is impacted.
More details of the data analysis procedure can be found in Section 4.
|Application||end user programs||bitcoin, macvim||112|
|Database||SQL and NoSQL databases||mongodb, influxdb||25|
|CodeAnalyzer||compiler, parser, interpreter||ruby, php-src||44|
|Middleware||operating systems, virtural machine||mitmproxy, codis||32|
4. Results and Analysis
For each research question, we first present the direct observations through three types of analysis approaches (i.e., density distribution, median-value ranking, and multiple regression), and then summarize the conclusions, followed by reasoning and analysis.
4.1. RQ1: Bug-Handling Effort among Programming Languages
4.1.1. Direct observations
Next, we perform multiple regression analysis by treating the variables555 We perform log transformation to stabilize the variance and improve the model fit
We perform log transformation to stabilize the variance and improve the model fit(ray2014large, ). introduced in Figure 1 as control variables, which are also regarded as inputs to the regression model. When doing the regression for the relative values (such as and ), we remove from the control variables, because the relative values are calculated by dividing , which is also an approach of variable controlling.
 Significant codes: p-value : ‘***’; p-value 0.001: ‘**’; p-value 0.01: ‘*’; p-value 0.05: ‘.’
 The R-squared values for line-modification and time prediction are above 0.90. The R-squared values for file-modification and time prediction are above 0.50.
To combine the results of median-value (in Figure 2) and multiple regression (in Table 4) analysis, for each language, we pick out its ranking results of both analysis approaches and present them in Figure 4. The blue and white bars are for rankings in median-value analysis and multiple-regression analysis respectively. For example, in the first sub-figure, when analyzing , the language C ranks No.4 in median-value analysis and No.2 in multiple-regression analysis, and thus its combined ranking number is 6.
From Figure 4, we have the following observations. First, the blue and white parts inside each bar mostly have similar length, indicating that the median-value and multiple-regression analysis results are highly coherent. This observation mutually indicate the reliability of each analysis approach. Second, we can now conclude with confidence that Java language would require a high level of line and file modification, but less so with regard to time. Ruby has high absolute and relative bug-handling time, whereas PHP, Python, and C have low bug-handling time as well as less code modification.
4.1.2. Conclusions and Analysis
Combining these observations, we have the following findings.
Finding 1: Different programming languages require different bug-handling effort. For example, Java tends to require more (absolute and relative) line modification but less handling time than other languages, and Python requires less bug-handling effort in terms of both line modification and time.
Findings on Java. Java tends to require more line/file modification, but less bug-handling time. This finding matches well with the widely-recognized understanding that Java is a verbose language (broussard2006method, ). Our result shows that this verbosity carries over to bug handling. Another language known for its verbosity is C#, which also has a high level of line modification. However, C# projects tend to be very large (see Figure 1), resulting in its value being moderated. Despite requiring large line modifications, Java is one of the languages with short bug-handling time, which is particularly observable through its value. This result suggests that bug-handling in Java requires a relatively small, and uniform amount of time, irrespective of the overall project size. One of the reasons may be that the large number of declarations required in Java, including type declaration, method parameter types, return types, access levels of classes, exception handling (broussard2006method, ), and so on, which make the language verbose but at the same time provide additional documentation for readers, making the code easier to understand and debug (arnold2000java, ). Additionally, Java has a history of over 20 years. Its long commercial life and wide adoption has created a robust ecosystem of documentation, libraries and frameworks. This may also contribute to Java’s good performance in bug handling efficiency.
Findings on Go. Similar to Java, Go tends to require more line/file modification and less bug-handling time considering the absolute values, but its relative values are small across the board. This reinforces our understanding that the elaborate requirement of declaring of variable types, method parameter types, return types, and so on, which is shared between Java and Go may cause large number of line modifications but at the same time making debugging relatively quick. The difference between the two languages is that Go projects are much larger than Java’s (see Figure 1), resulting in the lower relative values.
Findings on Python and PHP. Python and PHP need less absolute line/file modification as well as time. Python is widely recognized to have a large set of scientific libraries and a very active community, which make it easier for developers to find support during bug handling. It is also reported that there has been a trend in the Python community to improve code quality by dictating “one right way” (startup, ). This maturity of community and the effort of adhering to best practices is likely to facilitate bug handling.
PHP is also a mature language which has a vast ecosystem of developers, frameworks and libraries. The quality of the projects using PHP has the reputation of being more polarised, ranging “from horrible to awesome” (startup, ). In our study, PHP performs very well. This might be due to the fact that we select the most popular projects as the analysis target, which are likely to be in the “awesome” bucket. Further discussion of this potential bias can be found in Section 6.
Findings on Ruby. Opposite to Go, Ruby tends to require less absolute line/file modification, but more bug-handling time. And its relative measurements are large across the board. As a dynamic language, Ruby is designed to make programming a “pleasant” and “productive” experience (pythongood, ), which does not have hard rules on writing code, and is very close to spoken languages (flanagan2008ruby, ). Such features make Ruby code short and expressive. However, they also make debugging more difficult. One example of the flexible features of Ruby is “monkey patching”, which refers to the extension or modification of existing code by changing classes at run-time. It is a powerful technique that has become popular in the Ruby community. Any class can be re-opened at any time and amended in any way. However, this flexible monkey-patching may lead to hard-to-diagnose clashes (monkeypatching, ).
The Ruby’s compiler does not expose many bugs, and allows some problematic programs to compile and execute. This results in a certain form of technical debt777Technical debt is a “concept in programming that reflects the extra development work that arises when code that is easy to implement in the short run is used instead of applying the best overall solution.” (kruchten2012technical, ), and complex bugs that are hard to diagnose. Moreover, as Ruby programs are usually not large, as shown in Figure 1, its relative measurements are usually high. Additionally, Ruby is community driven, which means quality documentation and support can be more difficult to find.
Comparison between similar languages. When comparing C++ and Objective-C, we observe that the former requires less bug-handling effort (including line modification and time) than the latter. We suspect that this is because Objective-C has a mixture of static and dynamic typing, whereas plain C++ objects are always statically typed which simplifies understanding. Regarding Python and Ruby, the former requires less bug-handling effort than the latter. As discussed above, we suspect that this is partly due to Ruby’s relentless pursue for flexibility, which may result in hard-to-track bugs (rubyvspython, ). On the other hand, Python takes a more direct approach to programming, with light and uncluttered syntax. This sacrifices some of the “coolness” Ruby has, but gives Python a big advantage when it comes to debugging. Regarding Java and C#, the former requires slightly less bug-handling effort than the latter. One of the reasons may be that C# is more flexible than Java, which creates, returns, and stores anonymous objects at runtime.
Another pattern we can observe is that Go, Java, and C# all require more line/file modification but much less bug-handling time, indicating the inconsistency between the two criteria, which leads to the following finding.
Finding 2: Languages requiring more line/file modification do not necessarily need more bug-handling time. We think this finding may partially explain the contradictory views regarding the impact of programming languages on bug-handling effort, shown in online discussions (in Section LABEL:sec:online) and previous work (in Section LABEL:sec:related). That is, programmers or researchers may have used different measurement criteria, e.g., the amount of line modification or the amount of time spent in bug handling, and consequently draw very different conclusions. For example, Kleinschmager et al. (kleinschmager2012static, ) and Hanenberg et al. (hanenberg2014empirical, ) showed static languages have lower bug-handling effort because their empirical studies used bug-handling time as the only measurement criterion, while Tratt et al. (tratt2009dynamically, ) called static languages “the enemy of change” because static languages require more complex code modification.
4.2. RQ2: Bug-Handling Effort among Language Categories
To answer the second research question, we check the multiple regression results on different language categories, as shown in Tables 5. From the table, dynamic languages require less absolute code modification, whereas static languages, as well as strong languages, tend to have less bug-handling time.
 Significant codes: p-value : ‘***’; p-value 0.001: ‘**’; p-value 0.01: ‘*’; p-value 0.05: ‘.’
 The R-squared values for line-modification and time prediction are above 0.90. The R-squared values for file-modification and time prediction are above 0.50.
These observations can be summarized into the following finding. Finding 3: Static languages tend to require more absolute line and file modification. Weak/dynamic languages tend to require more bug-handling time. The reason for the former observation is that dynamic languages are typically less verbose than static ones, avoiding type declarations on variables, parameters, and return values, while the reason for the latter observation may be that the compilers of strong languages provide earlier bug detection, which eliminates some tough bugs and reduces technical debt.
Note that although the categories of languages do impact on bug-handling effort, our results indicate that no absolute conclusion can be drawn. In other words, it is unreliable to decide bug-handling-effort level based solely on the language’s category; for example, Ruby has strong typing, but also high bug-handling time.
4.3. RQ3: Impact of Domains
To investigate the impact of domains on bug-handling effort, similar to previous work (ray2014large, ) we divide the target projects into different domains (i.e., application, database, code-analyzer, middleware, library, framework, and others. (See Table 3)). For each domain, we only consider languages that have more than 5 projects, and use multiple regression as the analysis technique. Based on the new coefficient derived in this new setting, we rank the languages, and compare the new ranking results with the previous one (without considering domains) in Table 4. The comparison results (in Table 6) demonstrate the difference of bug-handling effort for a programming language between its overall usage (i.e., including all domains) and specific usage for a certain domain. Due to space-limit, we only present the results of bug-handling time. The full results are on our homepage (omitted for double-blind review).
Only three domains have enough projects for an interesting number of languages. The first column shows the languages, the remaining columns are the coefficients of each language in the new multiple regression within each domain, where “-” represents omitted languages that do not have more than 5 projects for the domain. The values inside the brackets are the changes in ranking888When there are less than 10 languages in a domain, the original ranking is updated by removing the absent languages.. For example, for “Application” domain, C++ has the smallest coefficients, and thus ranks the first, while from Table 4, C++ ranks the fifth among the seven languages. Thus, C++ projects belonging to the “Application” domain tend to have less bug-handling time than those belonging to other domains.
From the table, we have the following finding.
Finding 4: The impacts of programming languages on bug-handling effort are different among different domains.
In the future, we will use more projects in each domain to further investigate the impact of domains on bug-handling effort.
|C||3.051 (1)||–||2.994 (-1)|
|C#||3.151 (0)||1.548 (-2)||3.428 (-5)|
|C++||2.889 (4)||1.678 (-2)||3.623 (-5)|
|Go||3.375 (-3)||1.331 (-2)||–|
|Java||3.438 (-5)||–||2.551 (0)|
|Python||3.121 (3)||1.3108 (3)||3.065 (1)|
|Ruby||–||1.720 (2)||3.344 (2)|
|Objective-C||–||1.931 (-1)||3.146 (3)|
|PHP||–||1.330 (3)||2.980 (4)|
4.4. RQ4: Contribution to Bug-Handling-Effort Prediction
In the preceding analyses, we have concluded that programming languages may affect bug-handling effort, including line modification and bug-handling time. We now try to see whether this newly gained knowledge may help with the bug-handling-effort prediction problem, which is well-recoginzed as an important but difficult problem (weiss2007long, )
. One category of prediction is to estimate the handling-time of a specific bug in a project(weiss2007long, ; zhang2013predicting, ). For multi-language projects, bugs belonging to different languages may have different bug-handling time, but no previous work has considered the impact of programming languages. The other category is to predict the general level of bug-handling effort of a project, rather than a specific bug (hayes2005maintainability, ; wohl1982maintainability, ). As far as we are aware, no work has considered the impact of programming languages either.
In this section, we empirically investigate whether considering programming languages can contribute to bug-handling-effort prediction. In particular, we build a toy classification model to predict the general level of bug-handling time of a project, i.e., whether a project has high, medium, or low bug-handling effort999In this study we use 60 and 180 hours as a lines to distinguish high/medium/low bug-handling effort, so as to get almost the same number of subjects into each side (110 high, 100 median, and 140 low). We do not use the complex existing prediction approaches because this paper is not about delivering a realistic prediction model. based on SLOC, #commit, age, and contributor (the number of developers). We compare the effectiveness of this predictive model with or without using programming language as a feature. Moreover, we use Naive Bayers algorithm for the classification, and 10-fold cross validation for the evaluation.
The results are shown in Table 7. From the table, when considering programming languages, the effectiveness of prediction has improved notably. For example, the prediction precision improves by 18.8% (i.e., ), the AUC improves by 5.5% (i.e., ).
Finding 5: The inclusion of programming languages as a factor will improve the effectiveness of bug-handling-effort prediction.
In this section, we discuss several implications of our results. As already explained in the introduction, although the findings of our correlation analysis may not be fully generalized to imply the underlying causality nor thoroughly interpretable, they do nevertheless provide suggestions and guidance to developers and researchers.
5.1. Implications for Developers and Managers
Our results provide more support for developers when choosing languages, particularly when bug-handling effort is a concern. Of course, the choosing of programming language for a project is a complex process, involving a variety of factors that may or may not be technical. We do not claim that the result in this paper is in any way sufficient to solve this problem, but the findings clearly indicate that the choice of programming languages has noticeable impact on bug-handling effort, and could be used by programmers as part of the consideration.
5.2. Implications for Researchers
Our results could provide the following guidelines for researchers.
First, languages could be considered in the research of automatic bug-handling-effort prediction, a problem that has long been recognized as difficult, but with broad practical benefits (kaur2014software, ). Many researchers (kaur2014software, ; riaz2009systematic, ; wohl1982maintainability, ; hayes2005maintainability, ) have made dedicated efforts to improve the precision of such predictions. However, none of the existing work has considered the impact of programming languages, which we think is a missed opportunity. In Section 4.4, we conducted an experiment with a very simple model, and demonstrated that predictive accuracy can indeed be improved using the data collected for different programming languages. The predictive model we used is obviously too simple to be useful for serious prediction, but nevertheless the positive result reaffirms our findings and suggests a possible more accurate approach for automatic prediction.
Second, different languages may need different sizes of patches in the research related to automatic bug fixing. Judged by the amount of line and file modification required, our results suggest that larger patches may be considered for automatically fixing for C#, Java, and Go. These languages also need larger search space (across more lines and files) for finding proper code patches, and thus may be more challenging to handle than others.
Moreover, the languages requiring more bug-handling time such as Ruby and Objective-C are more costly to maintain, and therefore shall be the focus of automatic debugging and fixing research, as there are more to be gained.
6. Threats, and Effort in Reducing them
The threat to internal validity lies in the implementation of the study. To reduce this threat, the first three authors independently reviewed the experimental scripts of the empirical study to ensure correctness.
The threats to external validity mainly lie with the subjects. We decided to pick the most popular projects of each languages, which by definition is not representative. However, we believe that it is more interesting to study the best efforts of the communities, as the alternative of randomly selecting projects is likely to pollute the data with non-serious projects which this study aims to avoid. The large number of projects used in our experiment also helps in reducing this threat.
The threats to construct validity lie in how we accurately reflect the impact of languages. To reduce this threat, we have made a range of efforts.
Large dataset and Multiple measurement metrics. Our experiment is of a large scale, and we employ a variety of metrics to measure the impact of languages. The thinking is that while each of the metrics alone may not be a sufficient proxy of bug-handling effort, they may work together collectively and complement each other. Particularly as we have seen, the use of two categories of metrics (modification and time) resulted in a comprehensive set of findings, which more accurately reflect the complex nature of bug-handling.
Data validation. We pay special attention to the validity of our dataset. We took a random sample of the data we collected, involving 585 commits from all selected projects, and manually checked them. We found that 90% of it is clean (i.e., involving only the fixing of a single bug, and all the code modification is related to the bug-fixing), showing a high-degree of data validity. Moreover, we use the interval between the opening time and the time of the last comment as bug-handling time, which is shown to be a more accurate measurement of bug-handling time, than the seemingly more obvious choice of the interval between the opening and closing time (Zheng:2015:MIC:2786805.2786866, ).
Multiple analysis approaches. To reduce the risk of bias caused by a single analysis approach, we adopt three different ones: direct observation, median-value analysis, and multiple-regression analysis. The consistency in the results of our different analyses affirm the reliability of each. Moreover, we use variable controlling, by considering absolute measurements as well as relative ones, and treating four well-known influential factors as control variables in multiple regression.
7. Related Work
Except for the studies on bug-handling effort we have discussed in Section 2, there is other work that compares programming languages for other aspects, particularly software quality (i.e., the number bugs generated rather than the effort of handling them). Phipps (phipps1999comparing, ) conducted an experiment to compare programmer productivity and defect rate for Java and C++, and concluded that Java is superior. Daly et al. (daly2009work, ) empirically compared programmer behaviors under the standard Ruby interpreter, and DRuby which adds static type checking to Ruby. They found “DRuby’s warnings rarely provided information about potential errors”. Hanenberg et al. (hanenberg2010experiment, ) conducted an empirical study on the impact of a static type system for the development of a parser. The results show that “the static type system has neither a positive nor a negative impact on an application’s development time”. Harrison et al. (harrison1996comparing, ) conducted a quantitative evaluation on functional and object-oriented paradigms to investigate the code-quality difference between them. They found no significant difference in direct measures of the development metrics, such as the number of known errors, but found significant differences in indirect measures, such as the number of known errors per thousand non-comment source lines. Kochhar et al. (kochhar2016large, ) studied the effects of using multi-languages setting on code quality, and found that projects with multiple languages are error-prone. Ray et al. (ray2014large, ) investigated the effects of different programming languages on code quality. The results indicate that strong languages have better code quality than weak languages.
-  Grace Lewis, Daniel Plakosh, and Robert Seacord. Modernizing Legacy Systems: Software Technologies, Engineering Processes, and Business Practices. Addison-Wesley Professional, 2003.
-  Jeff Sutherland. Business objects in corporate information systems. ACM Comput. Surv., 27(2):274–276, June 1995.
-  Magne Jorgensen and Martin Shepperd. A systematic review of software development cost estimation studies. IEEE Transactions on software engineering, 33(1):33–53, January 2007.
-  Ashwin Pajankar. Python unit test automation. 2017.
-  Why do dynamic languages make it more difficult to maintain large codebases? http://softwareengineering.stackexchange.com/questions/221615/why-do-dynamic-languages-make-it-more-difficult-to-maintain-large-codebases.
-  Which programming language is the best for maintaining software? https://www.quora.com/Which-programming-language-is-the-best-for-maintaining-software.
-  Oscar Nierstrasz, Alexandre Bergel, Marcus Denker, Stéphane Ducasse, Markus Gälli, and Roel Wuyts. On the revival of dynamic languages. In Proc. ICSC, pages 1–13. Springer, 2005.
-  Pamela Bhattacharya and Iulian Neamtiu. Assessing programming language impact on development and maintenance: A study on c and c++. In Proc. ICSE, pages 171–180. IEEE, 2011.
-  Sebastian Kleinschmager, Romain Robbes, Andreas Stefik, Stefan Hanenberg, and Eric Tanter. Do static type systems improve the maintainability of software systems? an empirical study. In Proc. ICPC, pages 153–162. IEEE, 2012.
-  Stefan Hanenberg, Sebastian Kleinschmager, Romain Robbes, Éric Tanter, and Andreas Stefik. An empirical study on the impact of static typing on software maintainability. Empirical Software Engineering, 19(5):1335–1382, 2014.
-  Tegawendé F Bissyandé, David Lo, Lingxiao Jiang, Laurent Reveillere, Jacques Klein, and Yves Le Traon. Got issues? who cares about it? a large scale investigation of issue trackers from github. In Proc. ISSRE, pages 188–197. IEEE, 2013.
-  Thomas H Wonnacott and Ronald J Wonnacott. Introductory statistics, volume 19690. Wiley New York, 1972.
-  The Mean and Median: Measures of Central Tendency. http://stattrek.com/descriptive-statistics/central-tendency.aspx?Tutorial=AP.
-  Andrea Arcuri and Xin Yao. A novel co-evolutionary approach to automatic software bug fixing. In Proc. CEC, pages 162–168. IEEE, 2008.
-  Daniel Hansson. Automatic bug fixing. In Proc. MTV, pages 26–31. IEEE, 2015.
-  Andrea Arcuri. On the automation of fixing software bugs. In Proc. ICSE, pages 1003–1006. ACM, 2008.
-  Marvin Steinberg. What is the impact of static type systems on maintenance tasks? An empirical study of differences in debugging time using statically and dynamically typed languages. PhD thesis, Master Thesis, University of Duisburg-Essen, 2011.
-  Laurence Tratt. Dynamically typed languages. Advances in Computers, 77:149–184, 2009.
-  Michel F Sanner et al. Python: a programming language for software integration and development. J Mol Graph Model, 17(1):57–61, 1999.
-  Travis E Oliphant. Python for scientific computing. Computing in Science & Engineering, 9(3), 2007.
-  Which programming language you should use for a web backend. http://rz.scale-it.pl/2013/03/08/which_programming_language_should_you_use_for_a_web_backend.html.
-  Eight reasons c sharp is the best language for mobile development. https://blog.xamarin.com/eight-reasons-c-sharp-is-the-best-language-for-mobile-development/.
-  How does Python’s lack of static typing affect maintainability and extensibility in larger projects? http://stackoverflow.com/questions/3671827/how-does-pythons-lack-of-static-typing-affect-maintainability-and-extensibility.
-  Which language has the most maintainable/reusable code? https://www.quora.com/Which-language-has-the-most-maintainable-reusable-code.
-  Python Overview. https://www.tutorialspoint.com/python/python_overview.htm.
-  Static vs. dynamic typing of programming languages. https://pythonconquerstheuniverse.wordpress.com/2009/10/03/static-vs-dynamic-typing-of-programming-languages/.
-  WHAT IS THE BEST PROGRAMMING LANGUAGE FOR ME? http://www.bestprogramminglanguagefor.me.
-  Thoughts on programming language and maintainability. http://geekswithblogs.net/FrostRed/archive/2008/09/10/125055.aspx.
-  10 most popular programming languages. http://opensourceforu.com/2017/03/most-popular-programming-languages/.
-  The Most Popular Programming Languages for 2017. https://blog.appdynamics.com/engineering/the-most-popular-programming-languages-for-2017/.
-  10 Most Popular Programming Languages Today. https://www.inc.com/larry-kim/10-most-popular-programming-languages-today.html.
-  10 Best Programming Languages That You Need To Learn In 2017. https://fossbytes.com/best-popular-programming-languages-2017/.
-  Developer Survey Results 2017. https://insights.stackoverflow.com/survey/2017.
-  Baishakhi Ray, Daryl Posnett, Vladimir Filkov, and Premkumar Devanbu. A large scale study of programming languages and code quality in github. In Proc. FSE, pages 155–165. ACM, 2014.
-  GitHub. https://GitHub.com.
-  Github developer document. https://developer.github.com/v3/search/#search-repositories.
-  CLOC. http://cloc.sourceforge.net/.
-  GitHub API. https://developer.github.com/v3//.
-  Yasutaka Kamei, Emad Shihab, Bram Adams, Ahmed E Hassan, Audris Mockus, Anand Sinha, and Naoyasu Ubayashi. A large-scale empirical study of just-in-time quality assurance. IEEE Transactions on Software Engineering, 39(6):757–773, 2013.
-  Qimu Zheng, Audris Mockus, and Minghui Zhou. A method to identify and correct problematic software activity data: Exploiting capacity constraints and data redundancies. In Proc. FSE, pages 637–648, 2015.
-  Peter Kampstra et al. Beanplot: A boxplot alternative for visual comparison of distributions. Journal of statistical software, 28(1):1–9, 2008.
-  KS Srinivasan and David Ebenezer. A new fast and efficient decision-based algorithm for removal of high-density impulse noises. IEEE signal processing letters, 14(3):189–192, 2007.
-  Peter H Westfall and S Stanley Young. Resampling-based multiple testing: Examples and methods for p-value adjustment, volume 279. John Wiley & Sons, 1993.
-  Ben James Winer, Donald R Brown, and Kenneth M Michels. Statistical principles in experimental design, volume 2. McGraw-Hill New York, 1971.
-  Jerry L Hintze and Ray D Nelson. Violin plots: a box plot-density trace synergism. The American Statistician, 52(2):181–184, 1998.
-  Scott J Broussard and Eduardo N Spring. Method and system for isolating exception related errors in java jvm, April 25 2006. US Patent 7,036,045.
-  Ken Arnold, James Gosling, David Holmes, and David Holmes. The Java programming language, volume 2. Addison-wesley Reading, 2000.
-  Choosing the Right Programming Language for Your Startup. https://medium.com/aws-activate-startup-blog/choosing-the-right-programming-language-for-your-startup-b454be3ed5e2.
-  David Flanagan and Yukihiro Matsumoto. The Ruby Programming Language: Everything You Need to Know. ” O’Reilly Media, Inc.”, 2008.
-  Monkey-patching. http://stackoverflow.com/questions/5741877/is-monkey-patching-really-that-bad.
-  Philippe Kruchten, Robert L Nord, and Ipek Ozkaya. Technical debt: From metaphor to theory and practice. Ieee software, 29(6):18–21, 2012.
-  Ruby vs. Python. http://learn.onemonth.com/ruby-vs-python.
-  Cathrin Weiss, Rahul Premraj, Thomas Zimmermann, and Andreas Zeller. How long will it take to fix this bug? In Proc. MSR, page 1. IEEE Computer Society, 2007.
-  Hongyu Zhang, Liang Gong, and Steve Versteeg. Predicting bug-fixing time: an empirical study of commercial software projects. In Proc. ICSE, pages 1042–1051. IEEE Press, 2013.
-  Jane Huffman Hayes and Liming Zhao. Maintainability prediction: a regression analysis of measures of evolving systems. In Proc. ICSM, pages 601–604. IEEE, 2005.
-  Joseph G Wohl. Maintainability prediction revisited: diagnostic behavior, system complexity, and repair time. IEEE Transactions on Systems, Man, and Cybernetics, 12(3):241–250, 1982.
-  Arvinder Kaur, Kamaldeep Kaur, and Kaushal Pathak. Software maintainability prediction by data mining of software code metrics. In Proc. ICDMIC, pages 1–6. IEEE, 2014.
-  Mehwish Riaz, Emilia Mendes, and Ewan Tempero. A systematic review of software maintainability prediction and metrics. In Proc. ESME, pages 367–377. IEEE Computer Society, 2009.
-  Geoffrey Phipps. Comparing observed bug and productivity rates for java and c++. Softw., Pract. Exper., 29(4):345–358, 1999.
-  Mark T Daly, Vibha Sazawal, and Jeffrey S Foster. Work in progress: an empirical study of static typing in ruby. 2009.
-  Stefan Hanenberg. An experiment about static and dynamic type systems: Doubts about the positive impact of static type systems on development time. In ACM Sigplan Notices, volume 45, pages 22–35. ACM, 2010.
-  R Harrison, LG Samaraweera, Mark R Dobie, and Paul H Lewis. Comparing programming paradigms: an evaluation of functional and object-oriented programs. Software Engineering Journal, 11(4):247–254, 1996.
-  Pavneet Singh Kochhar, Dinusha Wijedasa, and David Lo. A large scale study of multiple programming languages and code quality. In Proc. SANER, volume 1, pages 563–573. IEEE, 2016.