Maven Central is one of the most popular and widely used repositories of JVM-based artifacts. It stores a large collection of software binaries together with their corresponding metadata in a well-defined structure, characterizing the exact version, date of upload and list of dependencies towards other artifacts. Preaching for reusability and ease of dependency management since its launch in 2004, Maven Central keeps attracting open-source projects and software vendors, reaching nowadays111September 6, 2018 more than unique artifacts.
Maven Central holds a treasure-worth big data that can reveal valuable insights about software engineering processes, evolution, and trends thanks to recent advances in big data analysis techniques. However, it is currently extremely challenging to perform analyses at the scale of the whole Maven Central. First, dependency relationships among artifacts are not modeled explicitly and cannot be queried. This data should be made available in a format that is conveniently consumed by big data processing and analysis frameworks, to run Empirical Software Engineering studies. Second, exporting all data from Maven Central is highly time and resource consuming because of the huge number of artifacts.
In this work, we showcase the Maven Dependency Graph, a novel dataset that aims at letting the Software Engineering community run empirical studies on the whole Maven Central. This open source graph222https://zenodo.org/record/1489120 includes metadata about Maven Central artifacts, indexed by deployment date in the Gregorian calendar. The graph includes more than explicit dependencies between artifacts as well as other relationships to represent artifacts’ version precedence. Artifacts are described by the 3-tuple ‘GroupId:ArtifactId:Version’, distinguishing different versions of a given library (‘GroupId:ArtifactId’)333Throughout the rest of the paper, we use library to refer to the couple GroupId:ArtifactId. This represents 85% of all Maven artifacts and their dependencies, as of September 6, 2018.
Our second contribution comes in the form of Maven-graph procedures. These procedures aim at facilitating queries over the big Maven Dependency Graph. This collection of procedures implement common queries; such as artifacts retrieval in time or per version-range, and many other features. We provide a custom Neo4j [neo4j] Docker image shipping the entire dataset, together with the procedures plugin. These procedures, as well as our Maven Miner tool that can collect a snapshot of Maven Central and store it into a graph database, are open-source and available online [maven-miner].
The Maven Dependency Graph is intended to answer high-level research questions about artifacts releases, evolution, and usage trends over time. It also provides a solid basis to select relevant subsets of artifacts for assessing specific software engineering challenges. The queries over the Maven Dependency Graph
can range from pattern matching techniques, e.g., ‘How often do libraries release new versions?’, to advanced big data analysis, such as ‘What are the most influential artifacts in the Maven Central?
’ or, even predictive models using machine learning, e.g., ‘What artifacts are more likely to be adopted or overlooked by the community?’.
The Maven Dependency Graph is related to the Maven Dependency Dataset [mdd] (MDD). This previous dataset captured a snapshot of the Maven Central on July 30, 2011 and aimed at supporting large-scale research on libraries’ releases and dependencies. Since then, the Maven Central has more artifacts and more dependencies. Hence, we believe that an updated dataset is valuable for the software engineering community. Yet, because of this huge growth, our dataset resolves dependencies only at the artifacts level, by opposition to MDD that abstracts dependencies at the source code level too.
Ii Description of the dataset
In this section, we provide a general overview of the dataset. First, we describe the data schema. Later we present the data retrieval methodology and tooling.
Ii-a Overview & Schema
We rely on a temporal graph-based representation to capture the artifacts’ dependency graph of the Maven Central. Figure 1 shows a simplified schema of the Maven Dependency Graph. Formally, , the Maven Dependency Graph, is defined as a tuple . is a set of nodes that model the Maven artifacts. Every artifact node has a timestamp referring to its deployment date. Each node holds a set of properties: its groupId, artifactId, version, and packaging. The property coordinates is used to identify artifact nodes uniquely. Its value comes in the form ’group:artifact:version‘. are calendar nodes, represented by dashed boxes in Figure 1. They operate as a proxy to artifacts timestamp release date property. Their main intent is to temporally index the artifacts by their release date. is a set of dependency relationships. Every can be regarded as a couple () where and are respectively the user and provider of a library. A dependency has a scope, which limits the transitivity of a dependency. is depicted by the label DEPENDS_ON in Figure 1. Finally, is the set of version precedence relationships, represented by the label NEXT. Every is described as a couple (,) where and are respectively a given artifact and its next release.
∀i,j ∈A, coord(i) = coord(j) ⟹i = j
∀i ∈A, coord(i) ≠∅
∀d ∈D, scope(d) ≠∅
Our schema adheres to a set of constraints, namely uniqueness and existence. Constraints (1) (1) (1) depict few of them. The first constraint assures that nodes are uniquely identified by their coordinates. Whilst, the remaining constraints assure that every resolved artifact contains some mandatory properties. Other constraints such as uniqueness of edges and calendar nodes are not covered in this paper.
Ii-B Descriptive statistics
The Maven Dependency Graph
represents a snapshot of the Maven Central Repository from September 6, 2018. Descriptive statistics can be found in TablesI and II.
While the Maven Central index contains M artifacts, almost entries are duplicated, leaving us with 2.8M artifacts. We retrieved metadata and dependency information for artifacts identified by their unique coordinates in the form of GroupId:ArtifactId:Version. The missing artifacts are either deployed in another artifact repository or their pom.xml is corrupted. As shown in Table I, these artifacts belong to unique groups and represent libraries (i.e., collections of artifacts with different versions but similar groupId and artifactId). Libraries exist in versions on average, with a minimum of 1 version and a maximum of versions, totaling upgrade operations. Other percentile values of versions count are provided in Table II.
The Maven Dependency Graph has edges, i.e., directed dependency relationships, regardless of their dependency scope. The graph has a density of . We call outgoing edges dependencies, while incoming edges are usages. Table II provides the minimum and maximum values of usages and dependencies, as well as the percentile values.
Ii-C Data Collection Methodology & Tooling
Data collection involved retrieving pom.xml files (at least one per artifact) from Maven Central, parsing them to retrieve metadata, and finally storing this information into a graph database. This a time-consuming process that we distributed on top of a Docker Swarm cluster. Figure 2 shows the overall architecture and methodology. The process ran on a cluster of 4 identical machines running an Ubuntu 18.04 LTS. Each machine has 16 Gb of RAM and 4 identical CPUs (MD A10-7700K APU with Radeon(TM) R7 Graphics, 2.105 GHz). One machine played the role of a Swarm Master while the others were Swarm Slaves.
We rely on a producer-consumer pattern to distribute the computation (upper part of Figure 2). The producer is responsible for reading the Maven Central Index, wrapping artifact coordinates into messages, and publishing them in a shared messaging queue. On the other side, each consumer retrieves one artifact coordinate at a time. For each artifact, the consumer resolves the artifact’s meta-data as well as its direct dependencies and store them in a graph database. Finally, the consumer acknowledges the broker having finished processing the message. In case of a consumer failure, the message broker puts back the message in the queue. Note messages are removed from the queue only if the corresponding consumer acknowledges so. Moreover, a message is processed by only one consumer at a time. When all artifacts are resolved, a post-processing phase is responsible for creating artifacts versions chains.
For message queuing, we use RabbitMQ [rabbitmq], a scalable and widely used message broker. As for the graph database backend, we rely on Neo4j [neo4j], one of the most popular NoSQL databases. It comes along with a powerful SQL-like graph query language, Cypher [cypher]. This simplifies the exploitation of the dataset in a very simple manner. Finally, to fetch artifacts from Maven Central and resolve their direct dependencies, we use Aether [aether] Eclipse, a Java library to manage artifact repositories.
Iii The Maven Dependency Graph in Action
We have implemented a graph-based persistence backend for the Maven Dependency Graph. This allows interested users to exploit the dataset through the Neo4j web interface, leveraging the Cypher graph querying language [cypher]. Cypher is an open-source declarative language to specify graph queries with patterns. Multiple drivers have been implemented around Cypher, allowing its integration in other graph databases, such as SAP HANA, or distributed processing frameworks like Spark and Hadoop [cypher-usage].
To further simplify queries on Maven Dependency Graph, we leverage Cypher procedures and functions. This mechanism supports the extension of Cypher by writing custom code, deploying it into the database, and calling it from Cypher. We have implemented a set of functions and procedures to simplify the description of queries involving versions comparison, artifacts selection by versions’ range or by date. Listing 3 shows a usage example of such functions. A complete list of Maven-graph procedures can be found online [miner-proc].
In the following, we illustrate some usage examples.
The artifacts deployed in 2018
Listing 1 lists all the artifacts that have been deployed during and use ‘Junit’, regardless of the scope.
Number of versions per library
This example shows how to make use of the precedence relationship (NEXT) to compute the number of versions per library. Listing 2 depicts the corresponding Cypher query. The query runs in two steps. Given a node with no incoming edges, it selects the longest path of the NEXT relationship and returns its length, together with the nodes’ groupId and artifactId. The second step simply selects nodes with neither incoming nor outgoing Next relationship and return 1 (i.e. one version) together with the groupId and artifactId. The results of the two steps are aggregated using the UNION operation.
Artifacts using older ‘JUnit‘ versions compared to libraries they are using
The query in Listing 3 simply selects all the nodes and where depends on , only on the ‘Test’ scope, but uses an older version of JUnit than . We use our custom procedure to check versions precedence. It takes as parameters an artifact node and a version as a String and returns true if the node’s version is strictly older than the given version. We use the label ‘junit’ instead of ‘Artifact’ to avoid checks on the groupId value and speed up query execution, relying on labels indexes.
Iv Research opportunities
In this section, different types of empirical analyses that can leverage our dataset, as well as some research opportunities it can open up.
Libraries maintenance: Java Libraries continuously evolve by releasing new versions with new functionality or enhanced performance. However, in some cases, clients decide not to upgrade their dependencies to newer versions. As a result, library maintainers may decide to continue maintaining parallel versions. When does this phenomenon happen? When do project maintainers decide to maintain two parallel versions? Why? Who are the clients that stick with an older version? To answer these questions, we should first identify libraries that keep on maintaining multiple versions. The Maven Dependency Graph can help to identify these projects, by comparing versions precedence of artifacts and crossing them with their release date. Subsequently, we can identify the clients that keep using older versions. Another side-effect to libraries evolution is the growing complexity of latest releases. When facing such issues, libraries’ maintainers decide to decompose the library into different ones, ending this library’s lifetime. Another interesting point could also be detecting two or more artifacts merging into a single one. The Maven Dependency Graph supports the empirical inquiry of this kind of behavior.
Libraries adoption trends: Wisdom of the crowd Vs. Hype-driven development Vs. legacy This question focuses on end-users instead of library maintainers. What are the motivations that steer their decision to use a specific library? Do they behave according to Rogers’ theory [rogers2003diffusion] of Diffusion of innovation? Are there any organizational or social factors influencing these decisions? The wisdom of the crowd principle favors the collective opinion of a group of individuals rather than that of a single expert. It has been used as a form of crowd-sourcing in software engineering for numerous tasks [mao2017survey]. In particular, Mileva et al. [Mileva09] encourages the wisdom of the crowds as a principle to assess developers deciding which library versions to use, and thus, avoiding some pitfalls experienced by other developers. However, many lead developers have been warning about the doom this might bring to their products. This anti-pattern development is called Hype-Driven Development. A more recent work et al. [gai17modeling] leverages the same principle to recommend consented library updates. Their recommendation system relies on a graph that is very similar to our dataset. To evaluate their approach, they constructed a graph containing nodes of maven unique artifacts. We believe that a replication with the Maven Dependency Graph that is 3 orders of magnitude larger would improve the quality of such recommenders.
We used Maven-miner, a set of tools and facility scripts, to collect the Maven Dependency Graph. The source code of Maven-miner is publicly available online [maven-miner]. Maven-miner runs in different setups, standalone, docker-compose mode, or docker-swarm mode. Ready to use Docker images and scripts are also available online. Instructions on how to the Maven-miner scripts can be found in the wiki section of the tool’s repository. Note it is discouraged to use the standalone mode to resolve all the dependencies in the Maven Central as it may take months to finish. The standalone mode is only intended to resolve a small set of artifacts.
The Maven Dependency Graph is publicly available and accessible from the tool’s repository. For ease of use, it comes in the form of docker images shipped with all the facility procedures simplifying data exploitation. For use outside of the Neo4j ecosystem, we have also released CSV files. The entirety of data can be found online:
Vi Limitations & threats to validity
Due to some technical limitations, we were, not able to resolve all the information about existing artifacts in the Maven Central. In particular, we do not consider artifacts dependencies that are hosted outside of the Maven Central repository. For this reason, some metrics like libraries usage and dependencies may not reflect the reality. Moreover, our dataset lacks some low-level information such as excluded dependencies. Consequently, querying the dependency tree of a given artifact may result in a super-set, including conflicting dependencies.
Although the proposed schema was designed to improve queries execution, very complex queries involving computation expensive operations, such as transitive usages traversals, require a significant computation power. Finally, our collection is limited to the Maven Central repository, any findings reflect only the state of practice in the Maven repository, and it should not be generalized.
We presented the Maven Dependency Graph, an open-source dataset that aims at enabling the Software Engineering community to conduct large-scale empirical studies on Maven Central. To ease the exploitation of this dataset, we provide a custom Neo4j Docker image shipping the entire dataset. It comes along with a very large collection of procedures implementing common graph queries and utility functions. We also introduced Maven Miner, a set of tools that enable the collection of the Maven Dependency Graph. Both Maven Dependency Graph and Maven miner are open-source and publicly available online.