The rising size and complexity of software multiply the demands put on adequate software testing.. Consequently, software development projects increasingly adopt unit testing  or even test-driven development  as a way to identify and correct program faults early in the construction process. However, the development of testing code does not come for free. Researchers have identified that one of the key reasons for the limited adoption of test-driven development is the increased development time . It is therefore natural to wonder whether the investment in testing a program’s code pays back through fewer faults or failures.
One can investigate the relationship between the software’s production code and its tests by utilizing code coverage analysis . This is a process that provides metrics indicating to what extent code has been executed—under various control flow measures . The corresponding metrics can be efficiently obtained through diverse tools [16, 7]. The process involves running the software’s test suite, and obtaining test code coverage metrics, which in this case indicate code that is or is not tested.
Then, to examine how test code coverage relates to software quality, numerous methods can be employed. One can look at corrected faults and see whether the corresponding code was tested or not [13, 11]. In addition, faults can be deliberately introduced by mutating the code  in order to look at how test code coverage relates to test suite effectiveness [6, 8]. Alternatively, one can artificially vary test coverage to see its effect on exposing known faults . Finally, one could look at software failures rather than faults and correlate these with code coverage.
In this study we investigate the relationship between unit testing and failures by examining the usage of unit testing on code that is associated with failures in the field. We do this in three conceptual steps. First, we run software tests under code coverage analysis to determine which methods have been unit tested and to what extent. Then, we analyze the stack traces associated with software failure reports to determine which methods were associated with a specific failure. Finally, we combine the two result sets to see whether unit-tested methods are more reliable than those that are not.
We frame our investigation in this context
through the following research question.
RQ How does test coverage at the method and line level relate to observed failures?
A finding of fewer failures associated with tested code would support the theory that unit testing is effective in improving software reliability. Failing to see such a relationship would mean that further research is required in the areas of unit test effectiveness (why were specific faults not caught by unit tests) and test coverage analysis (how can coverage criteria be improved to expose untested faults).
The main contributions of our study are a method for investigating the effectiveness of unit testing, an empirical evaluation between unit test coverage and failure reports, and an open science data set providing empirical backing for our findings.
In the following sections we describe the methods we used (Section II), present our results (Section III), discuss their implications (Section IV), examine the threats to the validity of our findings (Section IV), outline related work in this area (Section VI), and conclude with a summary of our findings and their implications (Section VII).
To answer our research question we collected data regarding failures, determined the most popular software version associated with the failure, built this specific software version, run the provided tests under a code coverage analysis tool, and joined the analyzed software failures with the corresponding code coverage analysis results. The data associated with this endeavor (code coverage analysis, stack traces analyzed, and combined results) are openly available online.111https://doi.org/10.5281/zenodo.2546790
Ii-a Analysis of Stack Trace Data
We conducted our research on a stack trace dataset from the Eclipse project,222https://software-data.org/datasets/aeri-stacktraces/resources/incidents_analysis.pdf and more specifically from the subset titled “All Incidents”. This consists of 2,083,979 crash reports provided in the form of json files.333http://software-data.org/datasets/aeri-stacktraces/downloads/incidents_full.tar.bz2
The data attributes that are interesting for the purpose of our research are the following:
EclipseProduct, the product associated with the Eclipse project,
BuildId, the version of the Eclipse source code, and
Stacktrace, the incident’s stack trace. Each stack trace consists of successive stack frames and their details (class, method, line).
Through an initial analysis of stack traces we located the release most often appearing in them in order to determine which product version to analyze in detail. (Given that production releases are widely distributed, numerous failures associated with a release are a sign of the release’s popularity rather than its inherent instability.) The data corresponding to the selected version consist of 126,026 incident files that have EclipseProduct equal to org.eclipse.epp.package.java.product and BuildID equal to 4.5.2.M20160212-1500.
Ii-B Generation of Code Coverage Data
We accessed the Eclipse source code through the Eclipse Platform Releng project,444https://wiki.eclipse.org/Platform-releng/Platform_Build which provides instructions for building the Eclipse Platform using preferred technologies identified as part of the Eclipse Common Build Infrastructure (cbi) initiative. This combines infrastructure, technologies, and practices for building Eclipse software. To ensure that test coverage results would be coeval with the corresponding incident reports, we retrieved the source code version of Eclipse corresponding to the one whose stack traces we chose to analyze.
In order to obtain data regarding Eclipse’s test coverage, we used the JaCoCo Code Coverage  system, which is an open-source toolkit for measuring and reporting Java code coverage. It offers instruction, line, and branch coverage.
Eclipse is a multi-module project, which hinders the derivation of code coverage reports, because the JaCoCo Maven goals used to work on single modules only. For that reason, we used the new “Maven Multi-Module Build” feature,555https://github.com/jacoco/jacoco/wiki/MavenMultiModule which implements a new Maven goal called “jacoco:report-aggregate”. This aggregates coverage data across Maven modules.
In order to apply this feature, we first added the JaCoCo plugin and profile in the Maven parent pom.xml file, and then we created a separate project where we:
configured the report-aggregate goal,
added as dependencies with scope compile the projects containing the actual code and with scope test the projects containing the tests and the .exec-suffixed data.
Ii-C Matching Stack Traces with Code Coverage
To match the stack trace methods with their code coverage we followed a simple three-step algorithm.
Process the incident files of the dataset, extracting all methods from the stack traces together with their order of appearance.
Process the xml file generated by the JaCoCo coverage report, extracting all methods together with their code coverage data.
Join the common methods of the two preceding lists into a new list containing the combined fields.
The resulting output has the following elements: class name; method name; attributes; test-covered lines; total lines; and whether the method appears within the top-10 stack frames, the top-6 stack frames, or in the very first stack frame.
Following the methods we described in Section II, we matched 14,902 crash methods with their test coverage details. In all about 43% of the methods were unit tested. As can be seen in Figure 2 (left), the number of unit-tested methods (8,553) involved in crashes was larger than the number of the ones that were not unit-tested (6,349). In terms of ratios, 12% of the unit-tested methods and 4% of the non-unit tested methods were associated with crashes.
We also examined test coverage of crashing methods in terms of lines. In the preceding paragraph with the term unit tested methods we refer to those that had a non zero number of test-covered lines. However, the level of test coverage can also matter. Overall, in terms of line coverage, we found that about 34% of the system’s lines were covered by unit tests. Focusing on the number of unit-tested lines, in methods involved in crashes, we see in Figure 2 (right) that, again, of the code lines in methods associated with crashes, those that were unit tested (113,737) were more than those that were not unit tested (96,111).
Drilling further in the association between code coverage and crashes, we examined the relationship between the methods’ test code coverage and the percentage of them involved in crashes. As can be seen in Figure 2, in crashed methods that were covered by tests, as the test coverage increases so does the percentage of them involved in crashes.
Finally, we examined how our results are affected by the method we chose to identify which methods are involved in crashes. By associating too many (potentially untested) methods with a crash it might be the case that we would be falsely implicating them in such crashes. The position of faulty methods in a stack trace was examined by Schroter and his colleagues . They studied 2,321 bugs from the Eclipse project, and examined where defects were located in stack traces as defined by the corresponding fix. Their research showed that 40% of bugs were fixed in the very first frame, 80% of bugs were fixed within the top-6 stack frames, and 90% of bugs were fixed within the top-10 stack frames. We correspondingly grouped and matched methods appearing in stack traces into three groups of methods: those that have appeared at least once in the very first, in the top-6, and in the top-10 stack frames. As can be seen in Figure 3, the higher number of unit tested methods being associated with crashes compared to those that are not persists even when narrowing down the method’s identification to the stack’s topmost method.
In isolation and at first glance, the results we obtained are startling. It seems that unit tested code is not significantly less likely to be involved in crashes. If anything, more crashes appear in unit tested methods than in methods that are not unit tested, and, furthermore, as the code coverage of a method’s lines increases so does the likelihood of the method being associated with a crash.
In an attempt to understand these results we need to appreciate that not all methods are unit tested and not all methods are unit tested with the same thoroughness. Figure 1 shows that fewer than half of the methods and lines are unit tested. Furthermore, Figure 4 shows that code coverage within a method’s body also varies a lot. This may mean that developers selectively apply unit testing mostly in areas of the code where they believe it is required.
Consequently, an explanation for our results can be that unit tests are preferentially added in complex and fault-prone code in order to weed out implementation bugs. Due to its complexity, such code is likely to contain further undetected faults, which are are in turn likely to be involved in field failures manifesting themselves as reported crashes.
This ostensible paradox is also apparent if we further analyze the code coverage of methods involved in crashes. As code coverage increases so does the percentage of methods involved in crashes (Figure 4). Again, one could argue that code for which developers invest in a high test coverage is complex and therefore fault-prone.
One may still wonder how can unit-tested methods with a 100% code coverage be involved in crashes. The answer to this is that test coverage is a complex and elusive concept. Test coverage metrics involve statements, decision-to-decision paths (predicate outcomes), predicate-to-predicate outcomes, loops, dependent path pairs, multiple conditions, loop repetitions, and execution paths [10, pp. 142–145], . In contrast, JaCoCo analyses coverage at the level of instructions, lines, and branches. While this functionality is impressive by industry standards, predicate outcome coverage can catch only about 85% of revealed faults [10, p. 143]. It is therefore not surprising that failures still occur in unit tested code.
An important factor associated with our results is that failures manifested themselves exclusively through exceptions. Given that we examined failure incidents through Java stack traces, the fault reporting mechanism is unhandled Java exceptions. By the definition of an unhandled exception stack trace, all methods appearing in our data set passed an exception through them without handling it internally. This is important, for two reasons. First, unit tests rarely examine a method’s exception processing; they typically do so only when the method under test is explicitly raising or handling exceptions. Second, most test coverage analysis tools fail to report coverage of exception handling, which offers an additional, inconspicuous, branching path.
It would be imprudent to use our findings as an excuse to avoid unit testing. Instead, practitioners should note that unit testing on its own is not enough to guarantee a high level of software reliability. In addition, tool builders can improve test coverage analysis systems to examine and report exception handling. Finally, researchers can further build on our results to recommend efficient testing methods that can catch the faults that appeared in unit tested code and test coverage analysis processes to pinpoint corresponding risks.
V Threats to Validity
Regarding external validity, the generalizability of our findings is threatened by our choice of the analyzed project. Although Eclipse is a very large and sophisticated project, serving many different application areas, we cannot claim that our choice represents adequately all software development. For example, our findings may not be applicable to small software projects, projects in other application domains, software written in other programming languages, or multi-language projects. Finally, we cannot exclude the possibility that the selection of a specific Eclipse product and release may have biased our results.
Regarding internal validity we see four potential problems. First, by looking at the unit testing of all methods appearing in a stack trace, we may be classifying too many methods as potentially faulty. (On the other hand not handling an exception and allowing it to crash the system is by itself a fault.) Second, employing JaCoCo on an old release which may have some deprecated code and archived repos, caused some unit test failures, resulting in a lower code coverage. Third, we excluded from the JaCoCo report non-Java code that is processor architecture specific (e.g theorg.eclipse.core.filesystem.linux.x86 bundle). Fourth, noise in some meaningless stack frames appearing in our stack trace dataset may have biased the results.
Vi Related Work
Among past studies researching the relationship between unit test coverage and software defects, the most related to our work are the ones that examine actual software faults. Surprisingly, these studies do not reach a widespread agreement when it comes to the relationship between the two. More specifically existing findings diverge regarding the hypothesis that a high test coverage leads to fewer defects. Mockus et al. , who studied two different industrial software products, agreed with the hypothesis and concluded that code coverage has a negative correlation with the number of defects. On the other hand, Antinyal et al. , who investigated an industrial software product, also found a negligible decrease in defects when coverage increases and concluded that test unit coverage is not a useful metric for test effectiveness. Furthermore, in a study of seven Java open source projects, Petric et al. found that the majority of methods with defects had not been covered by unit tests , deducing that the absence of unit tests is risky and can lead to failures. On the other hand, Kochhar et al. in another study of one hundred Java projects, did not find a significant correlation between code coverage and defects .
The above mentioned studies cover only fixed faults. In our research, we work with stack traces, which enable us to analyze field-reported failures associated with crashes. The associated faults include those that have not been fixed, but exclude other faults that are not associated with crashes, such as divergence from the expected functionality or program freezes. Furthermore, through the crash reports we were unable to know the faulty method associated with the crash. However, by placing our matched crash methods in three groups according to their respective position in the stack trace (in the very first stack frame, within the top-6 and the top-10 stack frames) we could obtain useful bounds backed by empirical evidence  regarding the coverage of methods that were likely to be defective.
Software testing contributes to code quality assurance and helps developers detect and correct program defects and prevent failures. Being an important and expensive software process activity it has to be efficient. In our empirical study on Eclipse project we used the JaCoCo tool to measure the test coverage and we analyzed field failure stack traces to assess the effectiveness of testing. Our results indicate that unit testing on its own may not be a sufficient method for preventing program failures. Many methods that were fully covered were involved in crashes, which may mean that the corresponding unit tests were not sufficient for uncovering the corresponding faults. However, it is worth keeping in mind that failures manifested themselves through exceptions whose branch coverage JaCoCo is not reporting. Research building on ours can profitably study the faults that led to the failures we examined in order to propose how unit testing can be improved to uncover them, and how test coverage analysis can be extended to suggest these tests.
We thank Philippe Krief and Boris Baldassari for their invaluable help regarding the Eclipse incident data set. Panos Louridas provided insightful comments on an earlier version of this manuscript. This work has been partially funded by: the FASTEN project, which has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 82532; the gsrt 2016–2017 Research Support (EP-2844-01); and the Research Centre of the Athens University of Economics and Business, under the Original Scientific Publications framework 2019.
-  V. Antinyan, J. Derehag, A. Sandberg, and M. Staron. Mythical unit test coverage. IEEE Software, 35(3):73–79, May 2018.
-  Thomas Ball, Peter Mataga, and Mooly Sagiv. Edge profiling versus path profiling: The showdown. In Proceedings of the 25th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL ’98, pages 134–148, New York, NY, USA, 1998. ACM.
-  Kent Beck. Test-Driven Development: By Example. Addison-Wesley, Boston, 2003.
-  Kent Beck and Erich Gamma. Test infected: Programmers love writing tests. Java Report, 3(7):37–50, July 1998.
-  A. Causevic, D. Sundmark, and S. Punnekkat. Factors limiting industrial adoption of test driven development: A systematic review. In Fourth IEEE International Conference on Software Testing, Verification and Validation, pages 337–346, March 2011.
-  Rahul Gopinath, Carlos Jensen, and Alex Groce. Code coverage for suite evaluation by developers. In Proceedings of the 36th International Conference on Software Engineering, ICSE 2014, pages 72–82, New York, NY, USA, 2014. ACM.
-  Marc R. Hoffmann, B. Janiczak, and E. Mandrikov. JaCoCo Java code coverage library, 2018. Available online https://www.eclemma.org/jacoco/. Current 2019-01-20.
-  Laura Inozemtseva and Reid Holmes. Coverage is not strongly correlated with test suite effectiveness. In Proceedings of the 36th International Conference on Software Engineering, ICSE 2014, pages 435–445, New York, NY, USA, 2014. ACM.
-  Y. Jia and M. Harman. An analysis and survey of the development of mutation testing. IEEE Transactions on Software Engineering, 37(5):649–678, September 2011.
-  Paul C. Jorgensen. Software Testing: A Craftsman’s Approach. CRC Press, Boca Raton, FL, 2002.
-  Pavneet Singh Kochhar, David Lo, Julia Lawall, and Nachiappan Nagappan. Code coverage and postrelease defects: A large-scale study on open source projects. IEEE Transactions on Reliability, 66(4):1213–1228, 2017.
-  Pavneet Singh Kochhar, Ferdian Thung, and David Lo. Code coverage and test suite effectiveness: Empirical study with real bugs in large systems. In Software Analysis, Evolution and Reengineering (SANER), 2015 IEEE 22nd International Conference on, pages 560–564. IEEE, 2015.
-  Audris Mockus, Nachiappan Nagappan, and Trung T Dinh-Trong. Test coverage and post-verification defects: A multiple case study. In Empirical Software Engineering and Measurement, 2009. ESEM 2009. 3rd International Symposium on, pages 291–301. IEEE, 2009.
-  Jean Petrić, Tracy Hall, and David Bowes. How effectively is defective code actually tested?: An analysis of JUnit tests in seven open source systems. In Proceedings of the 14th International Conference on Predictive Models and Data Analytics in Software Engineering, pages 42–51. ACM, 2018.
-  Adrian Schroter, Adrian Schröter, Nicolas Bettenburg, and Rahul Premraj. Do stack traces help developers fix bugs? In 2010 7th IEEE Working Conference on Mining Software Repositories (MSR 2010), pages 118–121. IEEE, 2010.
-  Mustafa M. Tikir and Jeffrey K. Hollingsworth. Efficient instrumentation for code coverage testing. In Proceedings of the 2002 ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA ’02, pages 86–96, New York, NY, USA, 2002. ACM.
-  T. W. Williams, M. R. Mercer, J. P. Mucha, and R. Kapur. Code coverage, what does it mean in terms of quality? In Annual Reliability and Maintainability Symposium. 2001 Proceedings. International Symposium on Product Quality and Integrity, pages 420–424, January 2001.