Software engineering research requires real software artifacts either to study their properties or to evaluate new techniques. Software datasets have emerged in the community as an effort to standardize and increase the reproducibility of software studies and the comparison between contributions. Each dataset focuses on one specific goal and has specific properties. For example, some datasets focus on the source code of different projects , others focus on software that compiles correctly , or even focus on specific characteristics such as having known and reproducible bugs [11, 16].
This paper presents a new dataset of software projects: Duets. Its name reflects the spirit of the library-client relationship. It consists of a collection of Java libraries, which build can be successfully reproduced, and Java clients that use those libraries. Duets aims to simplify research that focuses on behavioral analysis of Java software. In particular we want to encourage studies that analyze library behavior in a context. The availability of a set of clients for each library supports studies about the actual usage of the library. The dataset can be used both for static analyses and for dynamic analyses by executing the tests of the clients. For that purpose, we take a special care to build a dataset for which we ensure that both the library and the clients have a passing test suite.
This new dataset supports a wide range of usage purposes. In general, many program analyses require a list of compilable and testable software packages, which we provide with Duets. The dataset also supports more specific use cases, such as analyzing the API usage by the clients of a particular library , or debloating based on dynamic analysis [25, 2]. We provide a framework with the dataset. It can be used, for example, to detect projects that have flaky builds or study the reasons for flakiness or the build . In addition, the reproducible property of the project builds provides a sound ground for empirical studies on how pom.xml files (Maven build configuration file) are engineered [31, 18, 26].
We design Duets to contain a large diversity of projects and focus on the reproducibility of the build. We select only single-module Maven projects to simplify the reproducibility. Multi-module projects tend to increase the complexity and the fragility of the build and make it harder to analyze, study, and instrument. We also build each project three times, as an effort to ensure the reproducibility of the build. The test suite of the project has to pass and only have passing tests.
To summarize the contributions of this paper are the follow:
Duets, a dataset of libraries and clients. Both the libraries and the clients build successfully with Maven, i.e. all the test pass and a compiled artifact is produced as a result of the build.
A framework to generate the dataset. It includes scripts for the collection of data and execution of the tests in a sandboxed environment.
Raw data files from the dataset generation that can be reused for further researches, e.g., pom.xml files, Travis CI configuration files, and Dockerfiles.
Ii Data Collection Methodology
In this section, we describe the methodology that we follow to construct this dataset of open-source Maven Java projects extracted from GitHub. Duets is composed of two parts: a set of libraries, i.e., Java projects that are declared as a dependency by other Java projects, and a set of clients, i.e., Java projects that use the libraries from the first set. The construction of this dataset is performed in steps. The process is illustrated in Figure 1 and detailed in the following sections.
Ii-1 Collection of Java projects from GitHub
First, we use the list of Java projects extracted from GitHub by Loriot et al. . The authors queried the GitHub API on June 9th of 2020, to find all the projects that use Java as the primary programming language. The projects found were subsequently filtered, discarding those that have less than stars. This initial dataset includes Java projects.
Ii-2 Identification of single-module projects
Second, we select the subset of single-module projects among the Java projects. We choose single-module projects to have a clear mapping between client and library and have more reliable build reproduction.
We download the complete list of files for each project using the GitHub API.We consider that a project is a single-module project when it has a single Maven build configuration file, i.e., pom.xml. We exclude the pom.xml files that are in resource folders and test folders. At the end of this step, we keep (round(34560/147991*10000,0)round(34560/147991*100,1)) single-module Maven projects. The list of all the files is also part of our dataset and is available in the repository of the dataset.
Ii-3 Identification of libraries and clients
In the third step, we analyze each pom.xml from the projects. During the analysis, we first extract the groupId and artifactId qualifiers of each project. This pair of ids is used by Maven to identify a project. In the case where two projects declare the same pair of groupId and artifactId, we select the project that has the largest number of stars on GitHub. Second, we map the groupId and artifactId to the dependency declared in the pom.xml. At the end of this step, we obtain a list of projects that are used as dependency, i.e., libraries and a list of clients that use the libraries. During this step, we ignore the projects that do not declare JUnit as a testing framework, and we exclude the projects that do not declare a fixed release, e.g., LAST-RELEASE, SNAPSHOT. After this third step, we identify (round(155/34560*10000,0)round(155/34560*100,1)) libraries, and (round(25557/34560*10000,0)round(25557/34560*100,1)) clients that use versions of the libraries.
Ii-4 Identification of commits for each library release
The purpose of the fourth step is to identify the commit SHA identifier that determines each version of the library, i.e., the commit change in the pom.xml that assigns a new version. For example, the version of the library commons-net is defined in the commit SHA 74a2282b7e4c6905581f4f1b5a2ec412310cd5e7. To perform this task, we download all revisions of the Maven build configuration files since their creation. Then, we analyze the Maven build configuration files, and identify which commit declares a specific release of the library. We successfully identify the commit for / (round(1026/2103*10000,0)round(1026/2103*100,1)) versions for / (round(143/155*10000,0)round(143/155*100,1)) libraries. / (round(16964/25557*10000,0)round(16964/25557*100,1)) clients have been mapped to a specific commit of one of their dependencies.
Ii-5 Execution of the tests
As the fifth and last step, we execute three times the test suite of all library versions and all clients, as a sanity check to filter out libraries with flaky tests or projects that cannot be built. We keep the libraries and clients that have at least one test and have all tests passing: / (round(94/143*10000,0)round(94/143*100,1)) libraries, / (round(395/1026*10000,0)round(395/1026*100,1)) library versions, and / (round(2874/16964*10000,0)round(2874/16964*100,1)) clients passed this verification. From this point, we consider each library version as a unique library for clarification purpose.
Iii Description of the dataset
|Min||1st Qu.||Mean||3rd Qu.||Max||Avg.||Total|
|Libraries||# Line of code|
|# Years of activity||N.A|
|Clients||# Line of code|
|# Years of activity||N.A|
Table I summarizes the descriptive statistics of the dataset. The number of lines of code (#LOC) and the coverage are computed with JaCoCo. Duets includes different libraries, with a total of versions, as well as clients. Those libraries and clients are maintained by different GitHub organizations. The libraries include test cases that cover of the LOC. The libraries have a median maintenance time of years from commits created by contributors. The clients have test cases that cover of the LOC. The clients have a median maintenance time of years from commits created by contributors. The dataset and the scripts to generate the dataset are publicly available in our experiment repository: https://github.com/castor-software/Duets.
Iii-a Dataset format
Duets is available on GitHub and is composed of a JSON file.111https://github.com/castor-software/Duets/tree/master/dataset/dataset-info.json An excerpt of this JSON file is presented in the README of our repository. It contains the repository name, the SHA of the commit, the groupId, artifactId, the list of clients for each version of the library, and a list of commits that defines the different releases of the library. Duets also contains the logs corresponding to the test execution for each version of the libraries and for each client. As previously mentioned, we executed three times the tests to increase the likelihood of the reproducibility of the dataset.
In addition to the dataset itself, we include all scripts that generate the dataset. Those scripts can be used to reproduce the same dataset, to create a similar dataset for a different language or reusing too mine Github.
Iii-B Execution framework
In addition to the JSON file, we provide a Docker image that contains our execution framework. This framework adds an abstraction on top of Git repositories. It automatizes the cloning, checkout, execution of the test, and parsing the test results without requiring to specify and additional information than the URL of the repository. All those tasks are simplified into the following command line: docker run --rm -v ‘pwd‘:/results castorsoftware/duets:latest compile --repository https://github.com/radsz/jacop --commit 8f09fd977a. This framework can be extended to perform additional tasks, we provide an API that allows to execute the main Maven tasks and to manipulate the pom.xml files easily (e.g., to add or remove plugins). We extended the framework to perform static analysis and to collect the test suite coverage. Those examples are provided in the Duets repository as guidelines.
Iv Dataset Usage
Iv-a Pairs of libraries and clients
The clients in Duets can be used to identify APIs usage patterns between different clients , or to explore how the API evolution of the libraries affects their clients . For example, Figure 2 shows a weighted bipartite graph of the relationship between clients (on the left) and the packages belonging to one library, Apache Commons Codec (on the right). The width of the edges between a client and a package represents the number of classes in the package that are executed (either directly called or invoked internally in the library) when running the clients’ tests. This type of figure allows to visualize the parts of the library that are more used by its clients, which is valuable information for both the library users and the library maintainers.
The test-suite of the clients can also be used as further validation of modifications performed on the library. This is the type of usage that we leverage in our recent work , where we debloat libraries and verify that the compilation and the execution of their clients’ tests are not affected. Hence, having information regarding library usage by clients is useful for validating program transformation since it provides dynamic data that helps to overcome the limitations of static analysis in Java.
Another potential usage of the clients of the library is to use the client tests to generate tests for the libraries or to verify that changes in the libraries do not break the clients. This idea is developed in a recent study performed by Chen and colleagues .
Duets allows to compare the characteristics of libraries with respect to other software artifacts. For example, during our data collection, we observed that libraries have much more tests and have a higher test coverage than the clients (see Table I). A detailed analysis of the test part of the dataset could highlight differences between libraries and other types of applications.
Finally, Duets can be used to compare the coverage of the library with its own test suite and the test suites of the clients, to identify the intersection and difference between the two test suites, similar to the work of Wang et al.  that checks the similarity between test and production behavior.
Iv-B Buildable and testable Java projects
If the relations between the clients and the libraries are not required for a specific evaluation. Duets can be used as a list of projects that successfully build and have a passing test suite. This can be used as a dataset for dynamic analysis such as identifying API usage based on the client or libraries test suites. This dataset contains a large diversity of projects, large and small, from different fields.
Iv-C Build results of the projects
During the creation of Duets, we verified that projects are reproducible and have only passing tests (see Section II-5). We saved the test results of those executions. This data can be used to identify projects with failing tests, flaky tests or flaky builds. Identifying flaky builds is a hard task. This data could have simplified the work of studies like [15, 8, 5]. Table II presents some metrics of our reproduction attempts. We identify (round(221/7293*10000,0)round(221/7293*100,1)) projects that have flaky tests, (round(1009/7293*10000,0)round(1009/7293*100,1)) projects with at least one failing test case.
|# Reproduction attempts|
|# Buidable projects||(round(3642/7293*10000,0)round(3642/7293*100,1)) 3642/72931 - (3642/7293)|
|# Unbuidable projects||(round(3418/7293*10000,0)round(3418/7293*100,1)) 3418/72931 - (3418/7293)|
|# Failing-test builds||(round(1009/7293*10000,0)round(1009/7293*100,1)) 1009/72931 - (1009/7293)|
|# Flaky builds||(round(221/7293*10000,0)round(221/7293*100,1)) 221/72931 - (221/7293)|
|# Timeout||(round(12/7293*10000,0)round(12/7293*100,1)) 12/72931 - (12/7293)|
Iv-D List of files of Java projects
We downloaded the complete list of files for Java projects from GitHub. This list of files can be used to identify the usage of specific technologies such as Docker, continuous integration, build management systems, or investigating the adoption of some development practice such as including binaries in the repositories. Table III shows the rate of occurrence of these particular files in Duets. This data could be used for study like the one of Cito et al. .
|# Java files||(round(21519119/71768708*10000,0)round(21519119/71768708*100,1)) 21519119/717687081 - (21519119/71768708)|
|# pom.xml files||(round(363220/71768708*10000,0)round(363220/71768708*100,1)) 363220/717687081 - (363220/71768708)|
|# Gradle files||(round(229690/71768708*10000,0)round(229690/71768708*100,1)) 229690/717687081 - (229690/71768708)|
|# Travis files||(round(33513/71768708*10000,0)round(33513/71768708*100,1)) 33513/717687081 - (33513/71768708)|
|# GitHub Workflow files||(round(33513/71768708*10000,0)round(33513/71768708*100,1)) 33513/717687081 - (33513/71768708)|
|# Dockerfiles||(round(17403/71768708*10000,0)round(17403/71768708*100,1)) 17403/717687081 - (17403/71768708)|
Iv-E Analysis of pom.xml files
Duets contains pom.xml files. Those files can be used to analyze the common usage of pom.xml in open source repositories. This dataset of pom.xml files has several advantages compared to a dataset of pom.xml created directly from Maven Central. The pom.xml in Duets are directly associated with a Git repository and therefore additional information is available, such as the source code, the history of the project, or issues. This provides a solid starting point to analyze the co-evolution of build files and other artifacts .
V Related Work
Building software is a complex task. Indeed, Kerzazi et al.  observe that of CI builds in an industrial web application are failing during a period of months. Durieux et al.  shows that only of the builds are passing on Travis CI. Reproducing builds is even an harder task. Sulir et al.  show that of the builds in their dataset are not reproducible. Almost of the build problems are related to missing dependencies, followed by compilation errors in of the cases. Gkortzis et al.  observe a very similar build failure rate (), with the same causes. Neitsch et al.  analyze the build systems of open-source multi-language Ubuntu packages. They observe that of the packages cannot build or be rebuilt. They find that many build problems can be addressed, and note that build quality is rarely the subject of research, even though it is an important part of maintaining and reusing software.
The software engineering research community came up with several benchmarks of Java projects that focus on reproducibility of the build. Sulir et al.  attempted to build Java projects from GitHub, from which around of the builds succeeded. Martins et al.  presented in 2018 a dataset that follows the same idea but with compilable and compiled Java projects. Dacapo by Blackburn et al.  consists of a set of open source, real world applications with non-trivial memory loads. The difference between those datasets and Duets is that we constructed a up-to-date benchmark with recent and diverse projects, we also focus on Mavven that have a test suite and all tests are passing.
There are several datasets of buggy Java programs that generally also come with the non-buggy version of the program, such as [11, 16, 29, 23, 6]. Datasets that only focus on source code also exist, such as Boa, a dataset of queryable Java AST presented by Dyer et al. . Spinellis et al.  focus on identifying duplicated repositories on GitHub.
The closest work that focuses on studying libraries and their clients is the work from Leuenberger et al. . They analyze the binaries of artifacts in Maven Central to identify API clients. In contrast, we focus on the projects’ source code and to allow to build and test the software where Leuenberger et al. are interested to mine the API usage of compiled projects.
The major difference between all those datasets and Duets is that we focus on pairs of libraries and clients. To our knowledge, this has never been done.
In this paper, we presented Duets, a dataset of libraries and clients extracted from open-source projects on GitHub. Duets aims to simplify studies that rely on dynamic and static analysis of libraries’ usage in the Java ecosystem. In our previous work, we have used this dataset to study the impact of debloatig libraries on their clients. However, Duets also provides a fertile ground for other types of empirical studies, such as those that analyze the impact of API changes on library clients. Alongside the dataset itself, we provide a framework that aims to facilitate the data mining. Both the dataset and the necessary tools to reproduce it are open-source and publicly available online. We also provide the raw data that we use to generate Duets, including pom.xml files and the complete file list of Java projects.
This work is partially supported by the Wallenberg AI, Autonomous Systems, and Software Program (WASP) funded by Knut and Alice Wallenberg Foundation and by the TrustFull project funded by the Swedish Foundation for Strategic Research.
-  (2006-10) The dacapo benchmarks: java benchmarking development and analysis. SIGPLAN Not. 41 (10), pp. 169–190. External Links: Cited by: §V.
-  (2020) JShrink: in-depth investigation into debloating modern java applications. New York, NY, USA, pp. 135–146. External Links: Cited by: §I.
-  (2020) Taming behavioral backward incompatibilities via cross-project testing and analysis. In IEEE/ACM International Conference on Software Engineering, Cited by: §IV-A.
-  (2017) An empirical analysis of the docker container ecosystem on github. pp. 323–333. Cited by: §IV-D.
-  (2020) Empirical study of restarted and flaky builds on travis ci. In Proceedings of the 17th International Conference on Mining Software Repositories2013 35th International Conference on Software Engineering (ICSE)Proceedings of the 7th International Workshop on Evaluation and Usability of Programming Languages and Tools2012 28th IEEE International Conference on Software Maintenance (ICSM)2014 IEEE International Conference on Software Maintenance and Evolution2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR)Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software EngineeringProceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software EngineeringProceedings of the 17th International Conference on Mining Software RepositoriesProceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software EngineeringProceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software EngineeringProceedings of the 21st Annual ACM SIGPLAN Conference on Object-Oriented Programming Systems, Languages, and Applications, MSR ’20ESEC/FSE 2019ESEC/FSE 2020OOPSLA ’06, New York, NY, USA, pp. 254–264. External Links: Cited by: §IV-C, §V.
-  (2016) IntroClassJava: a benchmark of 297 small and buggy java programs. Cited by: §V.
-  (2013) Boa: a language and infrastructure for analyzing ultra-large-scale software repositories. pp. 422–431. Cited by: §V.
-  (2019) Understanding flaky tests: the developer’s perspective. New York, NY, USA, pp. 830–840. External Links: Cited by: §IV-C.
-  (2018) Exploring api: client co-evolution. In Proceedings of the 2nd International Workshop on API Usage and Evolution, pp. 10–13. Cited by: §IV-A.
-  (2020) Software reuse cuts both ways: an empirical analysis of its relationship with security vulnerabilities. Journal of Systems and Software 172, pp. 110653. Cited by: §V.
-  (2014-07) Defects4J: a Database of existing faults to enable controlled testing studies for Java programs. In ISSTA 2014, Proceedings of the 2014 International Symposium on Software Testing and Analysis, San Jose, CA, USA, pp. 437–440. Note: Tool demo Cited by: §I, §V.
-  (2014) Why do automated builds break? an empirical study. pp. 41–50. Cited by: §V.
-  (2017) KOWALSKI: collecting api clients in easy mode. In 2017 IEEE International Conference on Software Maintenance and Evolution (ICSME), pp. 653–657. Cited by: §V.
-  (2019) Styler: learning formatting conventions to repair checkstyle errors. arXiv preprint arXiv:1904.01754. Cited by: §II-1.
-  (2014) An empirical analysis of flaky tests. New York, NY, USA, pp. 643–653. Cited by: §IV-C.
-  (2019) BEARS: an extensible java bug benchmark for automatic program repair studies. In 2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER), Vol. , pp. 468–478. External Links: Cited by: §I, §V.
-  (2018) 50K-c: a dataset of compilable, and compiled, java projects. In 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR), pp. 1–5. Cited by: §I, §V.
-  (2012) The evolution of java build systems. Empir. Softw. Eng. 17 (4-5), pp. 578–608. Cited by: §I.
-  (2014) Mining co-change information to understand when build changes are necessary. In 2014 IEEE International Conference on Software Maintenance and Evolution, pp. 241–250. Cited by: §IV-E.
-  (2012) Build system issues in multilanguage software. pp. 140–149. Cited by: §V.
-  (2019) The software heritage graph dataset: public software development under one roof. In Proceedings of the 16th International Conference on Mining Software Repositories, MSR ’19, pp. 138–142. External Links: Cited by: §I.
-  (2020) What is the vocabulary of flaky tests?. In Proceedings of the 17th International Conference on Mining Software Repositories, MSR ’20, New York, NY, USA, pp. 492–502. External Links: Cited by: §I.
-  (2018) Bugs. jar: a large-scale, diverse dataset of real-world java bugs. In Proceedings of the 15th International Conference on Mining Software Repositories, pp. 10–13. Cited by: §V.
-  (2015) Mining multi-level api usage patterns. In 2015 IEEE 22nd international conference on software analysis, evolution, and reengineering (SANER), pp. 23–32. Cited by: §I, §IV-A.
-  (2020-08) Trace-based Debloat for Java Bytecode. arXiv e-prints, pp. arXiv:2008.08401. External Links: Cited by: §I, §IV-A.
-  (2020-01) A Comprehensive Study of Bloated Dependencies in the Maven Ecosystem. arXiv e-prints, pp. arXiv:2001.07808. External Links: Cited by: §I.
-  (2020) A dataset for github repository deduplication. pp. 523–527. Cited by: §V.
-  (2016) A quantitative study of java software buildability. pp. 17–25. Cited by: §V, §V.
-  (2019) Bugswarm: mining and continuously growing a dataset of reproducible failures and fixes. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), pp. 339–349. Cited by: §V.
-  (2017) Behavioral execution comparison: are tests representative of field behavior?. In 2017 IEEE International Conference on Software Testing, Verification and Validation (ICST), pp. 321–332. Cited by: §IV-A.
-  (2018) Do the dependency conflicts in my project matter?. pp. 319–330. Cited by: §I.