Automated builds are an integral part of software development. Developers spend considerable amounts of time on writing and maintaining scripts [McIntosh et al., 2011, 2015] that implement the build logic of their project. Such scripts may involve the compilation of source files, application testing, and the construction of software artifacts such as libraries and executables. The advent of Continuous Integration (ci) together with the complexity of modern software systems have made prominent two important properties related to automated builds: efficiency and reliability [Vakilian et al., 2015, Gligoric et al., 2014, Visser et al., 2016, Hilton et al., 2016]. To save computing resources and development time [Hilton et al., 2016, Licker and Rice, 2019], build tools must be capable of coping with complex systems quickly, but without sacrificing the reliability of the final deliverables. Following this direction, new build systems have emerged providing features such as parallelism [Gligoric et al., 2014, Bazel, 2020, Coetzee et al., 2011], caching [Gradle Inc., 2020a], incrementality [Erdweg et al., 2015, Konat et al., 2018], and the lazy retrieval of project dependencies [Celik et al., 2016].
Among these features, parallelism and incrementality are in the heart of almost every modern build system. Parallel builds reduce build times by processing independent build operations on multiple cpu cores. Incrementality saves time and resources by executing only those build operations affected by a specific change in the codebase. Both features are vital for a smooth development process, as they significantly shorten feedback loops [Konat et al., 2018, Visser et al., 2016]. For example, thanks to parallelism, building huge systems, such as the Linux Kernel or llvm, which consist of million lines of code and thousands of source files, can complete in a few minutes.
Parallel and incremental builds though, pose threats to the reliability of the build process when they are not used with caution. Conceptually, a build is a sequence of tasks that work on some input files, and produce results (output files), potentially used by other tasks. To avoid failures and race conditions, developers must specify all dependencies in their build scripts, so that the underlying build system does not process dependent tasks in the wrong sequence or in parallel (e.g., linking before compilation is erroneous). Similarly, for correct incremental builds, developers need to enumerate all source files that a build task relies on. This ensures that after an update to a source file, all the necessary tasks are re-executed to generate the new build artifacts reflecting this change. Build scripts are susceptible to faults because declaring all task dependencies is a challenging and error-prone task [Licker and Rice, 2019, Vakilian et al., 2015, Morgenthaler et al., 2012]. Even best practices [GNU Make, 2020a], and tools [Martin and Hoffman, 2010] for managing dependencies automatically are often insufficient for preserving correctness [Licker and Rice, 2019]. Build failures, non-deterministic and inconsistent build outputs, or time-consuming builds, are inevitably the result of such faults [Licker and Rice, 2019, McIntosh et al., 2011].
There is little prior work focusing on detecting incorrect build definitions, and existing approaches suffer from two major shortcomings [Licker and Rice, 2019, Bezemer et al., 2017] (as we discuss in 3.1) that prevent them from being useful in practice. First, previous approaches are tailored to analyze Make-based builds [Feldman, 1979] only. Therefore, applying the technique behind existing tools to other build systems, such as Java-based build tools, is not possible. One of the main reasons is that prior work makes strong assumptions about the internal behavior of build systems that are only relevant to Make builds. Unfortunately, there are not any techniques available to examine the reliability of builds originated from other systems beyond Make (e.g. Gradle), even though such build systems are extensively used [McIntosh et al., 2011, Hassan et al., 2017] and suffer from similar issues [Dashenkov, 2020, Greene, 2015]. Second, fault localization using prior methods requires a large amount of time that hinder their adoption. For example, employing mkcheck [Licker and Rice, 2019] to detect build-related issues in a Make-based project consisting of a hundred files can take hours (or even days).
We propose an effective and efficient dynamic method for detecting faults in parallel and incremental builds. Our method is based on a model (buildfs) that treats a build execution stemming from an arbitrary build system as a sequence of tasks, where each task receives a set of input files, performs a number of file system operations, and finally produces a number of output files. buildfs takes into account (1) the specification (as declared in build scripts) and (2) the definition (as observed during a build through file accesses) of each build task. By combining the two elements, we formally define three different types of faults related to incremental and parallel builds that arise when a file access violates the specification of build. Our testing approach operates as follows. First, it monitors the execution of a build script, and models this execution in buildfs. Our method then verifies the correctness of the build execution by ensuring that there is no file access that leads to any fault concerning incrementality or parallelism. Note that to uncover faults, our method only requires a single full build.
We demonstrate the applicability of our approach on build scripts written in two popular build automation systems, namely, Make and Gradle. Make is one of the most well-established build tools [McIntosh et al., 2015], while Gradle is a modern Java-based system that has become the de-facto build tool for Android and Kotlin programs [Karanpuria and Roy, 2018, Pelgrims, 2015, Derr et al., 2017]. Our approach is also applicable to other build systems such as Ninja, Bazel or Scala’s sbt. To the best of our knowledge, our approach is the first treatment of Java-oriented build executions.
Contributions. Our work makes the following contributions.
We propose buildfs, a model for specifying and verifying arbitrary build executions, which is the key for applying our fault detection approach in both traditional (e.g., Make), and modern (e.g., Gradle) build systems (Section 3).
We introduce a dynamic method that relies on buildfs, and is able to uncover issues in parallel and incremental builds by analyzing the execution of a single clean build. (Section 4).
We evaluate the effectiveness and the applicability of our approach by detecting issues in 324 out of 612 Make and Gradle projects. Notably, 235 issues found in 45 open-source projects were confirmed and fixed by upstream developers. Furthermore, our approach is more effective, and orders of magnitude faster than the state-of-the-art when analyzing Make projects (Section 5).
Availability. We are planning to archive and make the source code and data used in our experiments publicly available.
We provide the basic elements of Make and Gradle. Then, we discuss the types of fault that may occur in corresponding scripts and are related to incremental and parallel builds.
2.1. Build Systems
Make. Make is the oldest build system used today [Feldman, 1979, Licker and Rice, 2019]. It provides a domain-specific language (dsl) that allows developers to write definitions of rules that instruct the system how to build certain targets. For example, the following rule states that building the target source.o, which depends on the file source.c, requires to invoke the gcc command as shown at line 2.
By default, Make builds every target incrementally, meaning that it generates targets only when they are missing or when their dependent files are more recent than the target. Make uses file timestamps to determine whether a file has changed or not. Also, it provides some built-in variables starting with the symbol “$”. The most common ones are $@ and $^, which refer to the name of the target (e.g., source.o), and the dependencies (e.g., source.c) of the current rule respectively. Developers write their Make rules in files called Makefiles. In particular, developers can either write their own Makefiles, or for increasing their productivity, they can use higher-level tools, such as CMake [Martin and Hoffman, 2010] or gnu Autotools [Calcote, 2020] that automatically generate Makefile definitions. CMake offers its own dsl, and enables programmers to write rules which in turn are translated into Makefiles. CMake is useful for managing systems with complex structure. Autotools is a collection of tools that configure and generate Makefiles from templates.
Gradle. Although newer than other Java-based build tools, such as Ant and Maven, Gradle has gained much popularity recently. Currently, around 55% of the most popular Java Github projects use Gradle [Hassan et al., 2017], and it has become the preferred build tool for Kotlin and Android programs [Karanpuria and Roy, 2018, Pelgrims, 2015, Derr et al., 2017]. Gradle is at least two times faster than Maven [Gradle Inc., 2020c], as it offers features, such as parallelism, and a build cache.
Gradle provides a Groovy- and a Kotlin-based dsl which adopts a task-based programming model. In this sense, Gradle programmers assemble build logic in a set of tasks. A task is a fundamental component in Gradle that describes a piece of work needed to be done as part of a build. Developers can impose constraints on the execution order of tasks. Then, Gradle represents the build workflow as a directed acyclic graph and processes every task in topological ordering. To enable incremental builds, developers need to enumerate the files consumed and produced by each task. In this context, a task is executed only when there is a change to any of its input or output files. Gradle adopts a content-based approach to identify updates: it compares the checksum of the input / output files with that coming from the last build. Consider the following snippet:
The listing above demonstrates a task named extractZip written in Gradle. This task extracts the contents of an archive, namely /file.zip, into the directory /extractedZip. The input and the output files of this task are declared at lines 2 and 3 respectively. Declaring the input / output makes the task extractZip incremental. In this context, Gradle re-executes this task only when any of those files are modified. Notice that an input or an output file can be a directory (See line 3). In this case, Gradle recursively examines the contents of the directory for updates.
Gradle provides a rich api that developers can rely on to customize their builds, or create plugins. A plugin consists of a set of common tasks that can be reused across multiple projects, e.g., consider a plugin that applies a linter to the source files of a project. Up to now, there are more than .,3600 Gradle plugins available for use [Gradle Inc., 2020d].
2.2. Faults in Incremental & Parallel Builds
Three types of faults can occur due to incorrect build definitions: missing inputs, missing outputs, and ordering violations. The first two are associated with incremental builds, while the last one concerns parallelism.
Missing Inputs. A build definition manifests a missing input issue, when a developer fails to define all input files of a particular build task. This leads to faulty incremental builds, because whenever there is an update to any of the missing input files, the dependent build task is not executed by the build system. Consequently, the build system produces stale targets and outputs.
As an example, Figure 1 shows the fragment of a Make definition taken from the cqmetrics project. This build creates the executable qmcalc by linking the object files CMetricsCalculator.o, QualityMetrics.o, and qmcalc.o (lines 3–4). Every object file is created by a built-in Make rule that compiles each implementation file .c with a command of the form $(CC) $(CXXFLAGS) -c. By default, the input file of these built-in rules is only the underlying implementation file, e.g., the input file of the rule qmcalc.o is qmcalc.c. However, an object file might also depend on a set of header files. Thus, changing a dependent header file requires the re-generation of the object file. The developers tackle this issue by compiling every object file with the -MD flag (line 1). This flag stores all header files that a target relies on into a dedicated dependency file whose suffix is .d. The developers include these dependency files in their Makefile on line 5. Although compiling source files with -MD follows the best practices for managing Make dependencies automatically [GNU Make, 2020a], the above script is faulty, because only the dependency files of the object files included in the variable $OBJS (line 2) are considered. The issue here is that when there is an update to a header file that the rule qmcalc.o depends on, the object file is not re-created. Thus, the final executable qmcalc can be linked with stale object files. We reported this issue, and the developers confirmed and fixed it.
Missing Outputs. A fault related to missing outputs is similar to that related to missing inputs. However, this time the cause of this problem is that a developer does not properly enumerate the output files of a task. As with missing inputs, this issue makes incremental builds skip the execution of some build tasks even if their outputs have changed. Note also that Gradle caches the output files of a task from previous builds, and reuses them in subsequent ones when input files remain the same. Hence, missing outputs also affect the performance of a build making it run slower.
This is an important feature that makes Gradle much more efficient than other build systems [Gradle Inc., 2020c]. Note that missing outputs do not appear in Make builds, because Make considers only the timestamp of input files to decide if a target rule must be re-executed.
Ordering Violations. Every build tool supporting parallelism runs independent build tasks non-deterministically. This means that the build system is free to process unrelated operations in any order for achieving high performance. Non-determinism does not cause any problems to the build process, when two tasks are indeed independent, and the one does not depend on the other. However, race conditions emerge, when two build tasks are conflicting (e.g, the one produces something consumed by the other), but are executed concurrently. Developers can introduce ordering constraints in their build definitions as a side effect of explicitly defining dependencies among conflicting build tasks. An ordering violation occurs when a developer does not specify ordering constraints between two dependent tasks. Note that an ordering violation does not relate to the incremental issues discussed above. That is, there can be a task declared with the correct input / output relations, but it races with another conflicting task.
Figure 2 shows an example of an ordering violation. Here we have an excerpt of a real-world Gradle script (from the nf-tower project) whose goal is to create the fat jar of a Java application. A fat jar packages all .class files of the current project along with the .class files of project dependencies, forming the executable distribution of the project. The code first applies the built-in Gradle plugin "java" (line 1). This plugin—among other things—runs two tasks: (1) the task classes that compiles all Java files into their corresponding .class files, and (2) the task jar that generates a jar file containing only the classes of the current project. In turn, the code employs an external plugin (line 2) containing the task shadowJar (line 3) that eventually generates the fat jar of the project. The problem here is that the name of the fat jar generated by the task shadowJar conflicts with the name of the naive jar produced by the task jar. The tasks jar and shadowJar do not depend on each other, so Gradle is free to schedule jar after shadowJar. This erroneous ordering results in incorrect output, i.e., the task jar overrides the contents of the jar file produced by the task shadowJar. A fix to this problem is to create a fat jar with a different name (e.g, changing its classifier at line 5 to "-all"). The developers of nf-tower confirmed and fixed this problem.
As we will see in Section 5, such issues are widespread and affect the reliability of many software deliverables. This motivates the design of a generic approach for easing the adoption of incremental and parallel builds in practice.
3. A Model for Build Executions
Designing a technique that is able to locate faults in incremental and parallel builds regardless of the underlying build system requires a generally-applicable and precise model for reasoning about build executions. Existing models make assumptions about builds that make the testing approaches relying on them ineffective when applied to certain build tools (Section 3.1). To address this, we propose our model for understanding build executions (Section 3.2). Then, we introduce the notion of task graph (Section 3.3), a component that serves as a basis for ensuring the correctness of a build execution (Section 3.4).
Prior work on detecting faults in Make incremental builds, namely mkcheck [Licker and Rice, 2019], models build execution as a set of system processes created by the build system during execution, e.g., Make creates a new gcc process for compiling every source file. This model treats every process as a function that takes an input, and produces an output. The input stands for the set of files that are read by the process, while the output is the set of files written by it. The inputs and outputs of every process are computed by analyzing the system call trace of a build. Through this model, they infer the inter-dependencies among files by considering each output to be dependent on every input file. All dependencies are then transitively propagated using the process hierarchy. However, modeling build execution as a set of system processes is problematic for the following reasons.
Low Precision. The main assumption made by Make-based tools is that the build system always spawns a separate process when proceeding to a new build task. However, this assumption is no longer valid in modern build systems such as Gradle, Maven, or Scala’s sbt, where the same system process (e.g., JVM process) involves multiple build tasks. Tools that model builds as a sequence of processes become ineffective when applied to such build systems, as their analysis precision significantly drops.
To highlight how this feature of modern build systems affect the precision of existing work we provide a representative example. Consider a Gradle task A that reads a file A.in and creates a file A.out, and a Gradle task B with file B.in as its input, and B.out as its output. An approach that works on granularity of processes produces the dependency graph of Figure 2(a). The main Gradle process runs both tasks A and B; therefore, the analysis considers files A.in and B.in as the inputs of that process, and files A.out and B.out as its outputs. Conceptually, this merges the two tasks into a single task. The resulting graph is overconstrained [Licker and Rice, 2019], because the analysis over-approximates the set of dependencies. For example, when there is a change in the input file A.in, the analysis incorrectly considers that both output files (i.e., A.out and B.out) must be updated, even if A.in only affects A.out. Overconstrained graphs lead to dozens of false positives and negatives [Licker and Rice, 2019].
Efficiency and Applicability. Another core limitation of the existing model is that it only captures OS-level facts (e.g., file accesses, and file dependencies) which are computed while analyzing the build trace. To verify those inferred file dependencies against the specification of build scripts, prior work (i.e., mkcheck) triggers incremental builds by touching each source file, and checking whether the expected output files are re-generated in response to the updated input files. This makes the verification task extremely slow as it requires substantial resources when applying multiple incremental builds in large-scale projects [Licker and Rice, 2019] (see also Section 5.6).
A critical reader may think that combining static analysis with dynamic analysis is a workaround for this efficiency issue [Bezemer et al., 2017]. Specifically, another approach could perform static analysis on build scripts to extract task specification, and then compare this specification against the actual behavior of task observed during build execution. Nevertheless, reliably extracting task specification from build scripts through static analysis is particularly challenging (and in many cases not possible) for multiple reasons [Licker and Rice, 2019]. First, static analysis cannot reason about tasks whose inputs / outputs are dynamically computed and are not known in build scripts. The same applies for tasks not explicitly mentioned in build scripts, e.g., tasks defined in external Gradle plugins as illustrated in Section 2.2. Second, static analysis needs to reason about the complex semantics of build system’s dsl. This limits generalizability, as applying the approach to a new build system requires implementing a new static analyzer which involves a lot of engineering effort. Third, even when a static analyzer is available, OS-level facts (inferred dynamically by the existing model) are not comparable with task specifications (computed statically), when the build system abstracts tasks as arbitrary functions as in the case of Gradle or Scala’s sbt. To further clarify this, consider the following example.
In the Gradle script on the left, we have two incremental tasks (task A and B) performing some arbitrary operations. The specification of the task A says that this task is expected to consume the file /file/A, while the task B reads the files /file/A and /file/B. Note that the specification only indicates the intent of the developer, and not the actual interactions of task with the system. The latter is shown in the execution trace on the right. In this scenario, it is not possible to compare the actual behavior of tasks (inferred by analyzing the execution trace on the right) against build specification (extracted statically from the build script on the left). This is because existing dynamic analysis techniques are unable to map the file accesses shown on the right to the task they belong to. This is necessary for deciding correctness. For example, if the first access comes from task A while the remaining ones stem from task B, the build script is not faulty. On the other hand, if it is the other way around (i.e., the last two accesses belong to the first task), the task A manifests a missing input on file /file/B, as it consumes a file not mentioned in the build script.
3.2. Modeling Builds
All the points discussed in the previous section are fundamental issues associated with the method’s design and underlying model, and not with its implementation. We introduce buildfs, a model for thinking about build executions that addresses the main limitations of existing work.
The proposed model treats every build as a sequence of tasks rather than system processes. Every task corresponds to the execution of a build operation. For example, a task in buildfs stands for the execution of a target rule in Make and Ninja, a goal in Java Maven, or a Gradle task in Gradle. This tackles low precision introduced by prior work, because it enables us to relate every build task to its correct input and output files regardless of the internal behavior of the build tool (e.g., whether it spawns a separate process or not). For example, unlike the overconstrained dependency graph of Figure 2(a), buildfs allows us to infer the precise graph shown in Figure 2(b). buildfs separates file accesses based on which task they belong to. Therefore, it does not perform unnecessary merges when encountering tasks governed by the same process, which is the main source of imprecision in previous work [Licker and Rice, 2019].
For dealing with efficiency and applicability, buildfs provides each task with a specification that consists of (1) a set of files that the task is expected to consume, (2) a set of files that the task is expected to produce, and (3) a set of task dependencies. A task dependency indicates that a task depends on another, i.e., it is executed only after the dependent task. Beyond specification, every task has a definition containing all the (low-level) file system operations performed while executing the task, e.g., reading and writing files, or changing the OS transient structures, such as the file descriptor table. Combining the (high-level) specification and actual behavior (low-level file system operations) of each task makes our approach efficient and applicable, for we can verify correctness, that is, whether the actual behavior conforms to the specification, by analyzing a single clean build i.e., no need to run incremental builds or static analysis on build scripts.
Figure 4 shows the complete model for build executions. A build execution consists of a sequence of tasks. Every task is described by a unique name ( TaskName), and contains a specification and a definition. The specification declares the input / output files and the dependencies of each task, while its definition consists of statements. For example, task A (/file/in): /file/out after means that the task named A consumes the file /file/in, produces the file /file/out, has no dependencies (after ), while its definition is given by .
A definition is one or more statements. There are two types of statements. First, the sysOp in = statement executes a system operation in a process given by . Every process defines a scope for file descriptor variables (fd). File descriptor variables point to paths and are used to model the file descriptor table of Unix-like processes. An operation () executed inside a process may introduce new file descriptor variables in the current process (scope) (let fd), or delete existing ones (del). Moreover, an operation may perform various file system updates, including file creation (produce) and file consumption (consume). An expression () can be a constant path, a file descriptor variable, or at . The latter allows us to interpret the path relative to the path given by the expression (note that the result of an expression is a path). Finally, the newproc statement creates a fresh process (scope) , and it optionally copies all file descriptor variables of an existing process to (newproc from ). This models process forking.
As an example of modeling, consider a simple build scenario where we want to copy the contents of the file /source into the file /target. Figure 5 shows how we can express this using Make and Gradle. When we execute these build scripts, the build system first opens the file /source, reads its contents, then opens the file /target, and finally writes the contents of /source to the file descriptor corresponding to the second file. Figure 6 illustrates how we model the execution stemming from these scripts. Every build consists of a single task named target. This task consumes /source to create /target. The definition of the task target creates a new process (line 2) and uses it to execute all the file-related operations performed when running Make and Gradle (lines 3–9). For instance, the operation let fd /source creates a new file descriptor (in the current process) pointing to file /source, while the operation at line 5 consumes this file descriptor. These operations model file opening. On the other hand, the operation del(fd) deletes the given file descriptor once the task closes the corresponding file (line 9).
The semantics of buildfs are also shown in Figure 4. Every task is evaluated on a state . The state gives all file descriptor variables defined in every process. The result of a task evaluation is a new state, and the set of files consumed and produced by this task, i.e., . Note that the projection gives the set of files consumed by the task, while is the set of produced files. Statements, operations, and expressions are evaluated accordingly. Notably, operations and expressions are evaluated inside the scope (process) where they take place, i.e., . As an example, after evaluating the following buildfs task on the state , we get a new state , while the set of consumed files is , and the set of produced files is .
3.3. Task Graph
Now, we introduce the notion of task graph. The task graph is a component that stores the input files, output files, and the dependencies of every task as declared by the developers in build scripts. The task graph is computed by traversing the specification of every task found in a buildfs program (Section 3.2), and collecting all input/output files and task dependencies. We later use the task graph for ensuring the correctness of a build execution (Section 3.4).
We define the task graph as . A node in the task graph is either a task or a file . The set of edges , where , determine the following relationships. Given a task graph , the edge indicates that the file has been declared as an input of the task . The edge states that the task produces the file . Finally, the edge shows a task dependency, i.e., the execution of precedes that of .
Given a build execution modeled as a buildfs program (Figure 4), we gradually compute the task graph by inspecting the specification of every task entry. The edges, where , are constructed by examining the header part of a task construct. For instance, whenever we encounter a task entry of the form task (): after , we add the following edges to the task graph : (1) an edge, (2) an edge, and (3) an edge.
A complete example is shown in Figure 7, where we have a build execution in buildfs on the left, and its resulting task graph on the right. Red nodes denote tasks while blue nodes indicate files.
3.4. Correctness of Build Executions
Having proposed our model for build executions and the concept of task graph, we now formalize the property of correctness for build executions. To do so, we exploit the task graph and define the subsumption and happens-before relations that we use as a base for verifying correctness.
Definition 3.1 ().
(Subsumption). Given a task graph , we define the reflexive, binary relation and its transitive closure on two paths . The definition is shown in Figure 8.
The subsumption relation says that the path is subsumed within the path . This relation is reflexive ([self]), and for every path , we have . The relation holds when is the parent directory of ([par-dir]), or when relies on , i.e., there is at least one task in the task graph that produces using ([indirect]). As we will see later the subsumption relation is important for ensuring that a file access made while executing a build task matches the task’s specification.
Definition 3.2 ().
(Happens-Before). Given a task graph , we define the binary relation and its transitive closure on two tasks . The definition is shown in Figure 9.
The happens-before relation states that the task is executed before . The definition of this relation consults the task graph to identify tasks that are connected with each other through an edge, which indicates a dependency between two tasks. Finally, the transitive closure of gives indirect task dependencies. The happens-before relation enables us to verify that two dependent tasks are always executed in the correct order.
3.4.1. Verifying Correctness of Tasks
Using the subsumption relation, we now formalize what the property of correctness means for a buildfs task.
Definition 3.3 ().
(Missing Input). Given a task graph and a state , a task task : after manifests a missing input on state , when
In other words, to verify that a task does not contain a missing input issue, we first compute all file accesses made by the task on the given state , i.e., . Then, we check that every file consumed by this task (i.e., ) matches the input files declared in the specification. To do so, we exploit the subsumption relation. In particular, when there exists a path consumed by this task for which , we say that the task has a missing input on the state . In practice, this means that although the build task relies on (as the definition of task consumes ), the build system does not trigger the execution of the task, whenever is modified.
Example. Consider the following build execution and its task graph.
When examining the task , we presume that it does not contain any missing input issues, as it only consumes the file "/f1/f3" (line 4) which is subsumed within the input file /f1 declared at line 1 (recall the [par-dir] rule from Figure 8). On the other hand, the task consumes three files (lines 8–10). For the first access (line 8), the relation holds, as the path consumed by is the same with that declared in the specification. The subsumption relation also holds for the second access, as the file f1/f3 is an input of the first task whose output is used as an input for the target task . Therefore, changing this file will first trigger the execution of task . This will eventually cause the invocation of task , because the first task updates the inputs of . This behavior is captured by the [indirect] rule (Figure 8). Finally, the task manifests a missing input for the third access (line 10), because we have .
Definition 3.4 ().
(Missing Output). Given a task graph and a state , a task task : after manifests a missing output on state , when
The definition for missing outputs is conceptually similar to that for missing inputs. This time however, we check that for every file produced by the examined task, the relation holds, where stands for the declared output files found in the specification of the task.
Given the above definitions, we now introduce the notion of correctness for a certain buildfs task.
Definition 3.5 ().
(Correctness of Task). Given a task graph and a state , a task is correct on state , when it does not manifest a missing input or missing output on state .
3.4.2. Verifying Correctness of Build Executions
Recall that a build execution in buildfs is a sequence of tasks . A build execution may manifest an ordering violation, when there are pairs of tasks that access a file , at least one of them produces , and there is no ordering constraint between these tasks, i.e., they can be executed in any order.
Definition 3.6 ().
(Ordering Violation.) Given a task graph and an initial state , a build execution manifests an ordering violation on state , when such that
Contrary to missing inputs and outputs, the definition for ordering violations checks whether two tasks with a conflicting file access are executed in the right order. To achieve this, we use the happens-before relation . For example, consider a task that creates a file . When the same file is consumed by a task , the must hold. Otherwise, the build system is free to execute before . Therefore, may access a file that does not exist, resulting in a build failure.
Definition 3.7 ().
(Correctness of Build Execution). Given a task graph and an initial state , a build execution is correct, when
the task is correct on state , and when the task is also correct on state for .
the build execution does not manifest an ordering violation on state for .
Definition 3.7 summarizes our approach for verifying a build execution. We begin with examining and evaluating tasks in the order they appear in a buildfs program according to the semantics of Figure 4. The initial state is . Evaluating a build task gives us a new state, and the set of files consumed and produced by the task. We then verify that the task is correct, that is, it does not contain any missing inputs or outputs while we also check that it does not conflict with any previous task based on the Definition 3.6 for ordering violations. Finally, we use the fresh state to evaluate the next task and perform the same verification task.
4. Testing Approach
We now present the practical realization of our model, which works on the three phases shown in Figure 10. During the first phase (Generation) we monitor the execution of an instrumented build script, and generate a buildfs representation that models this execution. As we will explain shortly, the instrumentation performed on build scripts provides the generation step with all the necessary high-level information to produce a valid buildfs program, such as, the execution boundaries and specification of every build task.
In the second phase (Analysis), we analyze the generated buildfs program, and produce two outcomes. First, we construct the task graph capturing the input / output files, and the dependencies of every buildfs task. Second, we track all file accesses of every task, by evaluating each task in build execution as for , and with .
The final step (Fault Detection) verifies the correctness of the given build execution (modeled in buildfs) using the file accesses and task graph computed by the previous step. Specifically, this phase reports those file accesses that violate the correctness of build execution, i.e., they lead to missing inputs, missing outputs, or ordering violations according to the definitions of Section 3.4.
4.1. Generating BuildFS programs
To model a build execution in buildfs, our dynamic approach takes an instrumented build script as input and monitors its execution. The goal of the instrumentation applied to build scripts is to provide the execution boundaries of every task, and other information coming from build definitions (i.e., declared input / output files and dependencies). To do so, we place instrumentation points before and after the execution of each build task, and augment their execution by calling special native functions. These functions take a string argument that either contains the information originated from build definitions, or indicate when the execution of a task begins or ends. Then, our dynamic analysis identifies these calls, and extracts their arguments to construct the specification of each task, and map the intermediate file operations to the corresponding task. In this manner, through monitoring these special native functions calls and all the other file system operations (e.g., a call to open) that take place while building, we are able to construct buildfs programs.
An example of native function calls inserted by our instrumentation are writes to standard output. These calls are triggered by inserting simple print statements as part of the instrumentation. Consider again the build scripts of Figure 5. When monitoring their naive execution (no instrumentation is added), we observe the file system operations at lines 4–9 that reflect file copying. When instrumenting these scripts, we augment their execution by adding the native function calls at lines 1–3, before the execution of the task target, along with the function call at line 10 after file copying. These calls enables us to identify when the task target begins and ends (lines 1, 10), along with its input / output files (lines 2, 3). Our dynamic analysis detects and examines these calls, and finally produces the buildfs representation shown in Figure 6. Note that without this instrumentation, we are unable to map the file system operations (lines 4–9) to the task they come from, and to build the specification of the currently executed buildfs task.
The instrumentation extracts task specifications from the execution engine of the build system, and not from build scripts. To perform a sound execution, every build system is aware of all the dependencies and input / output files of each task at runtime. For example, the build system can recognize all task dependencies to schedule the execution of a task in the correct order. Similarly, in an incremental build, the build system is aware of all declared file inputs to determine which tasks must be executed in response to some identified file updates. Our approach benefits from extracting this information from the execution engine of build systems at runtime for two reasons. First, we do not have to perform static analysis (a challenging task as we discussed in 3.1) to extract this information from build scripts. Second, we can recognize all dependencies and input / output files, including the ones computed dynamically and not explicitly mentioned in build scripts. Details regarding the instrumentation of build scripts are implementation specific and depend on the underlying build system as we show in Section 4.3.
4.1.1. From File System Operations to BuildFS Operations
Here we describe how we model a file-system operation to a buildfs statement, operation or expression.
Operations on paths. A system operation that works on paths is translated to either consume or produce operations, depending on its effect on the file system. For example, when a build task creates a new directory through the mkdir("/dir") system call, we emit a produce("/dir") operation. Similarly, when the build system creates a hard link to an existing file by invoking the link("/source", "/target") system call, we yield two buildfs operations: consume("/source"), produce("/target").
Operations on file descriptors. When the build process creates a new file descriptor, we use the let fd operation. For example, when creating a new file descriptor through opening a file open("/file") = 3, we emit let fd "/file". We do the same, when copying an existing file descriptor to a new one, e.g., dup2(3, 4) turns into let fd fd. Finally, closing a file descriptor leads to del operations.
Working Directory. Each system process operates on a specific directory. In buildfs, we use a special file descriptor variable, namely fd, that points the working directory of the current process. Whenever, the working directory of a process changes (through the chdir system call), we emit a let fd operation to model this effect.
Relative Paths. Some file system operations operate on relative paths. For example, the call to mkdir("dir") creates the directory dir inside the current working directory. We handle relative paths through the at expression. Specifically, we model the above example as produce("dir" at fd).
Forking Processes. When the build system creates a new process from an existing one (e.g., by calling the clone system call), we generate a newproc from statement, where refers to the id of the new process, while is the parent process. Finally, we model the main build process using a newproc statement.
4.2. Analyzing BuildFS Programs & Detecting Faults
After modeling a build execution in a buildfs representation, our method performs a linear pass over the representation and produces two types of output. First, it generates the corresponding task graph, and second, it computes all file accesses that take place in every buildfs task based on the semantics presented in Figure 4.
In the final step of our method (fault detection), we verify the correctness of a build execution based on the task graph and the file accesses computed in the analysis step. In particular, we examine the file accesses of every task , and we proceed as follows. If a file access is of type “consumed” (“produced”), and the subsumption relation (Section 3.4.1) between and the file inputs (outputs) of does not hold, we report a missing input (output) on . For ordering violations, we check whether was accessed elsewhere (say another task ) in the given buildfs program. If this is the case, we verify whether the execution order between and is deterministic using the happens-before relation. If the happens-before relation between these tasks is undefined, we report an ordering violation. Our fault detection approach eventually reports all file accesses that violate the correctness of the given build execution according to the definitions of 3.4.
The faults related to parallelism manifest themselves non-deterministically (depending on the execution schedule of the build system), while the ones associated with incrementality do not appear in full builds. However, our technique is capable of detecting subtle and future latent faults, because it does not require the build to crash and then reason about the root cause of the failure.
We have implemented our method as a command-line OCaml program, which we plan to make publicly available as an open-source software. To trace system operations, we employ strace [McDougall et al., 2006]. Note that this can be accomplished by using either other system call tracing utilities (e.g., DTrace [Rodriguez, 1986])) or dynamic binary instrumentation [Bruening et al., 2012, Nethercote and Seward, 2007].
Our tool parses the strace output and translates it into a buildfs representation. The implementation supports two modes: offline, and online. In the offline mode, our tool does not monitor builds. Instead, it expects a file containing the strace output obtained from previous runs. When in online mode, the tool generates and analyzes a buildfs program, while monitoring a build command through strace. To do so, it creates two processes. The first process runs strace on the build command, while the second reads the strace output produced by the first process and runs the buildfs generation and analysis steps in a streaming fashion. Communication is done through pipes which allows processes to run concurrently. Notably, this eliminates the observable time spent on the analysis phase, because running the build is much slower than the analysis of the corresponding buildfs programs. Therefore, in a multicore architecture, our tool exploits a spare core to perform the analysis as the build runs.
To instrument Gradle scripts, we have implemented a Gradle plugin written in Kotlin that hooks before and after the execution of every task as shown in Figure 11(a) (lines 1, 14 – irrelevant code is omitted). The plugin utilizes the Gradle api [Gradle Inc., 2020b] to print the following elements: (1) declared inputs / outputs of every task (lines 3–5, 6–8), (2) declared dependencies of every task (lines 9–11), and (3) execution boundaries of every task (lines 12, 16). This output is identified by our dynamic analysis and converted to buildfs tasks as explained in 4.1. To apply our plugin to a Gradle project, we modify Gradle scripts by inserting only four lines of code.
For instrumenting Make scripts, we created a shell script (fsmake-shell) that wraps the execution of every Make rule (Figure 11(b)). As with Gradle, this script prints the execution boundaries (lines 4, 8) and prerequisites of each task (lines 5). To achieve this, we override Make’s built-in variable $SHELL to point to our script. After printing the necessary information, our script invokes the underlying shell to eventually execute the requested Make command (line 7). To handle Make dependencies generated at build time (e.g., through gcc -MD), we refine the task graph computed during the analysis phase by adding missing edges. To do so, we exploit information stemming from the Make database by running make -pn after each build. Note that we do not need to make any changes in the source code of build scripts to enable tracing; we simply build projects by running
Applying our method to a new build tool requires little development effort. Our Gradle plugin contains 90 lines of Kotlin code, while fsmake-shell consists of only 8 lines of shell code.
Limitations. Currently, our tool can trace builds only in a Linux environment. However, extending our implementation to support monitoring in other platforms is straightforward. Also, strace introduces a 2x times slowdown on builds, on average (see Section 5.5). Employing a tracing utility that runs in the kernel space to track system operations, may reduce the overhead on build execution [Celik et al., 2017]. Furthermore, non-deterministic builds (i.e., touching different files on different days) may lead to false negatives when a faulty file access does not happen when running the build. However, unlike other approaches (e.g., mkcheck), our tool can cope with non-determinism occurred in subsequent builds (e.g., temporary files generated with random names), because it requires a single build for performing the verification.
We evaluate our approach by answering the following research questions:
(Effectiveness) What is the effectiveness of our approach in locating faults in build scripts? (Section 5.2)
(Fault Importance) What is the perception of developers regarding the detected faults? (Section 5.3)
(Fault Patterns) What are the main fault patterns? (Section 5.4)
(Performance) What is the performance of our approach? (Section 5.5)
(Comparison with state-of-the-art) How does the proposed approach perform with regards to other tools (i.e., mkcheck)? (Section 5.6)
5.1. Experimental Setup
We applied our approach to a large set of Gradle and Make projects. To identify interesting Gradle projects, we employed the Github apito search for popular Java, Kotlin, and Groovy repositories that use Gradle. We selected 200 projects for each language (i.e., 600 projects in total) ordered by the number of stars. For every project, we performed the following steps. First, we instrumented the Gradle scripts as described in Section 4.3. Then, we ran the instrumented Gradle scripts through the gradle build command, which is the de-facto command for building Gradle projects. Note that this command executes the compilation, assembling and testing tasks as well as other user-defined tasks. For efficiency, we ran our tool in online mode (Section 4.3). In the end, we successfully analyzed and generated reports for 312 projects. The build of the remaining projects failed because it required human intervention, e.g., to set up a specific environment for the build. This was also observed in prior work [Hassan et al., 2017].
For diversity, we discovered Make projects from two sources. First, we used the Github api to collect popular C/C++ projects. Second, to ensure the buildability of the examined projects, we also employed the Ultimate Debian Database (udd) [Debian, 2020b] to identify widely-used Debian packages based on the “vote” metric, which indicates the number of people who regularly use a specific package [Avery Pennarun and Reinholdtsen, 2020]. The build workflow of Debian packages uses the sbuild utility [Debian, 2020a], which automates the build process of Debian binary packages by creating the necessary build environment (e.g., it installs all build dependencies in an isolated environment) for a particular architecture (e.g., x86-64). sbuild allows us to hook over the build phase of its process. In this manner we can monitor each build and perform our own analysis. We built every Debian package using our Make wrapper (Section 4.3) instead of the default Make command. In total, we examined 300 Make projects coming from the Github and Debian ecosystems. Overall, the list of the selected Gradle and Make projects contains popular ones (e.g., the SQLite database, the Spring framework, and more) which involve complex build scripts. The characteristics of projects are summarized in Table 1.
We ran every Gradle and Make build in sequential mode as in the work of Licker and Rice  to make fair comparisons against mkcheck. However, our approach is able to support parallel builds by tracking the thread (and its descendants) where every build task is running. Finally, we ran the builds on Docker containers inside a host machine with an Intel i7 3.6ghz processor with 8 cores and 16gb of ram.
5.2. RQ1: Fault Detection Results
|Project Characteristics||Fault types|
|Build System||Projects||Avg. LoC||Avg. BLoC||MIN||MOUT||OV|
Table 1 summarizes our fault detection results. Our method identified problematic builds in 73 out of 312 Gradle projects. There are 157 issues related to incremental builds from which 122 faults are missing inputs appearing in 58 projects, while the remaining faults (35) are associated with missing outputs found in 20 projects. Faulty parallel builds are also common in Gradle projects, as we uncovered 80 ordering violations in 25 Gradle repositories. Furthermore, our tool detected issues in 251 Make projects; it discovered .,15740 Make target rules with missing inputs. Most of them involved missing header dependencies concerning object files. It also reported .,14 ordering violations that may lead to race conditions in 5 projects. Note that missing outputs are only relevant to Gradle (Section 2.1). For this reason, we modeled every Make rule as a buildfs task that was producing (i.e., any file). Therefore, based on the definition of the subsumption relation (Figure 8), a missing output issue (Definition 3.4) is not possible for Make.
To verify that the issues detected by our tool are indeed faults, we worked as follows. For Gradle projects, we examined each fault report, and tried to reproduce it. Specifically, we automatically verified each issue related to incremental builds by checking that re-running Gradle does not trigger the execution of tasks marked with missing inputs/outputs by our tool, even after updating the contents of their dependent files. We followed the same automated approach for the verification of the reported Make faults associated with incremental builds. For ordering violations, we manually verified that executing conflicting build tasks in the erroneous order can affect the outcome of a build, e.g., causing build failures, or producing build targets with incorrect contents.
5.3. RQ2: Fault Importance
We provided fixes for 71 Make and Gradle projects that we chose while we were examining their fault detection results, and in turn, we submitted patches to the upstream developers. Patch generation was done manually, mostly because of the complex structure of the faulty projects, and the peculiar semantics of build systems’ dsl (especially that of Gradle). We leave repairing build scripts through automated means as future work.
Table 2 enumerates the faults that are confirmed and fixed. Notably, 235 issues found in 45 out of 71 projects were fixed, while most of the remaining patches are in a pending state. The list of projects where our patches were accepted contains popular projects, such as tinyrenderer (~8k stars), caffeine (¿ 7k stars), aeron (~5k stars), Cello (~5k stars), and more. The list also includes projects that are maintained and developed by well-established organisations, such as conductor (developed by Netflix), tsar (developed by the Alibaba Group). This indicates that the faults we identified do matter to the community.
5.4. RQ3: Fault Patterns
When we manually examined the issues generated by our tool, we recognized the following five fault patterns, which result in build failures, time-consuming builds, or erroneous build outcomes. We identified three kinds of faults related to incremental builds caused by missing inputs or outputs.
Test resources. To ensure the correctness of their programs, developers typically specify dedicated build rules for performing different forms of testing (e.g., unit and functional testing) during build. Running tests is a time-consuming task [Gligoric et al., 2015], so the build rules associated with tests are triggered only when there are updates to any of the source files that tests rely on. As with source files, changing any of the resources used by tests (e.g., test data or additional helper scripts) must re-run tests to make sure that the change does not break anything. Not running tests is a missed opportunity to identify potential issues and may lead to late identification of bugs.
For example, the Gradle project kscript contains a test suite of Kotlin files included in the test/resources directory. The tests of kscript contains test assertions that rely on the state and contents of the files included in this test suite. However, the developers failed to declare the test suite directory as input of the Gradle task test. Our tool detected this fault, and we reported to the developers who fixed it.
Stale artifacts. As already discussed, the main goal of a build is to construct artifacts, such as executables, libraries, documentation accompanying software, and more. The build process must re-generate these artifacts, when any of the files used for their construction is updated since the last build. Failing to do so can lead to stale artifacts, which in turn, can either harm the reliability of applications, (e.g., cause runtime errors), or generate wrong build outputs.
This pattern is particularly common in Make builds where developers do not enumerate the dependencies of object files correctly. As an example of stale artifacts, recall the build script of Figure 1. This example demonstrates that even best practices for tracking dependencies automatically (e.g., through gcc -MD) are not sufficient for ensuring the correctness of builds.
Time consuming tasks. The purpose of incremental builds is to reduce build time by running only the build tasks needed to achieve a specific goal. This boosts productivity as it enables developers to get feedback and respond to changes of their codebase much earlier. To avoid unnecessary computation, it is important that time consuming build tasks are incremental.
Below we discuss two categories of faults related to parallel builds.
Conflicting Producers. We have identified issues associated with tasks that produce the same file or write to the same output directory. Parallel execution of such build tasks is harmful, because race conditions may emerge, when two tasks affecting the same state (i.e., files) run concurrently.
The Gradle script of Figure 2 is an example of conflicting producers. Figure 13 shows another example coming from the libcs50 project. This Make script defines a rule (line 2) that creates two libraries inside the build/lib directory (see variable $(LIBS)). The code first compiles the source file into the corresponding object file (line 3) from which a shared library, namely $(LIB_BASE), is constructed (line 4). Then, it creates a symbolic link ($(LIB_VERSION)) pointing to the newly-created library (line 6), and finally moves these files to the /build/lib directory (line 7). The official documentation of gnu Make states that such a rule definition is incorrect [GNU Make, 2020b]. In our example, the rule at line 2 is executed twice (one for every target defined in the $(LIBS)) variable). Consequently, the parallel build might crash with the error “mv: cannot stat ’libcs50.so.10.1.0’: No such file or directory”, as every rule execution races against each other. Specifically, when the second rule invocation attempts to move the libraries, they may have already been moved to /build/lib by the first rule. The developers of libcs50 immediately fixed this problem.
Generated Source Files and Resources. Many projects generate part of their source code or resources at build time. These automatically generated source files and resources are then compiled or used later by other build tasks to form the final artifacts of the build process, e.g., binaries. Developers must be careful enough to preserve the correct execution order between the build tasks that are responsible for generating and using these source files and resources. Ordering violations (e.g., compiling code when source files are missing) are the root cause for build failures, or subtle errors detected at a later stage of software lifecycle.
Figure 14 presents a code fragment taken from the popular caffeine project. The code specifies that the source files of the project are stored in the build/generated-sources directory (line 1). These source files are generated automatically by the Gradle task generateNodes. To do so, this task runs the class NodeFactoryGenerator with "build/generated-sources" as an argument (lines 2–5). Then, this code applies the plugin "com.bmuschko.nexus" used for uploading the sources jar file to a remote repository. To assemble a jar file containing the source files of the application, this plugin adds the sourcesJar task to the project. The problem with this code is that no dependency is declared between the tasks generateNodes and sourcesJar. Thus, the build process uploads empty artifacts to the remote repository, when Gradle executes sourcesJar before generateNodes. The developers of caffeine confirmed and fixed this ordering issue.
5.5. RQ4: Performance
To measure the performance of our approach we recorded the time spent at each step (recall Figure 10). The generation step, which is responsible for executing and monitoring our instrumented builds, dominates the execution time of our method. In particular, this step slows down both Gradle and Make builds by a factor of around two for the 90 percentile of the examined projects. This is consistent with the recent literature [Licker and Rice, 2019], as the main overhead of this phase stems from the system call tracing utility (i.e., strace).
The analysis of buildfs programs takes around 2.47 and 5.1 seconds on average for Make and Gradle projects respectively, and is linear to the size of programs. This phase is efficient enough to analyze gbs of programs in a reasonable time (e.g., 6.9gb in less than 3 minutes). In online mode, though, the observable time spent on the analysis step is eliminated, as the overall time is bounded to the time needed for a build. As explained in 4.3, this is because the processing of buildfs programs is faster than the build itself, and thus we take advantage of multicore architectures. Finally, the fault detection step is pretty fast; it takes only 0.11 and 0.45 seconds on average for Gradle and Make projects respectively.
5.6. RQ5: Comparison with state-of-the-art
As a first step, we built and analyzed a number of Gradle projects with mkcheck. After finding that mkcheck produces meaningless reports and an overwhelming number of false positives (this is not surprising because mkcheck is unable to deal with Java-based builds as explained in Section 3.1), we focused on performing comparisons only for Make projects. We applied mkcheck to the 300 Make projects and we recorded the fault reports and the time spent at each phase (i.e., build time and fault detection time). Note that build time includes the time needed for building project as well as the time taken for generating and analyzing the build trace.
In terms of fault detection, mkcheck produced false positives in three cases due to granularity of processes. That is, two build tasks were merged into a single task because they were governed by the same system process, leading to imprecision. False positives are also observed in the initial work of Licker and Rice . On the contrary, our approach did not generate false positives as it can reliably determine all file accesses of each task as explained in Section 4.1.
Figure 15 demonstrates the relative times between building and detecting faults with our approach and mkcheck. Notice that our tool and mkcheck spend almost the same amount of time for building and monitoring a project. The maximum speedup is 9x, the minimum is -1.25x, while the average is 1.19x. Although mkcheck uses ptrace for tracking system operations, which is 20% times faster than strace [Licker and Rice, 2019], our approach benefits from running the generation and analysis steps concurrently. Moving to fault detection times, we observe that our approach is much faster. Specifically, we can detect faulty builds up to six orders of magnitude faster than mkcheck, and the minimum speedup is only 83x. This huge speedup is explained by the fact that our approach needs only one build to uncover faults. Notably, in projects consisting of a large number of source files, mkcheck required days to detect faults (e.g., it spent 3.3 days for analyzing ghostscript). This is because mkcheck performs an incremental build per source file to verify correctness, something that hinders its scalability. Finally, when considering the overall time (i.e., build + fault detection time), our approach is 74x times faster than mkcheck, on average, while the maximum and minimum speedup is .,1837x and 2.6x respectively.
We also provide some absolute performance times on Table 5.6. Overall, our method required 21 seconds (on average) for analyzing builds and detecting faults (see phase “Overall” on Table 5.6), while mkcheck spent .,3390 seconds on average for performing the same tasks. The median time is 4.1 and 186.75 seconds for our tool and mkcheck respectively. Notably, mkcheck spent more than ten minutes for reporting faults in the 29% of the inspected Make projects.
These findings indicate that our method is superior to the state-of-the-art in terms of both fault detection and performance. Moreover, due to its effectiveness and efficiency, we argue that our method can be used in practice as part of the software testing pipeline.
6. Related Work
Our work is related to four research areas: testing and debugging builds, understanding and refactoring builds, trace analysis, and regression test selection.
Testing and Debugging Builds. Testing build scripts is an emerging research area. mkcheck [Licker and Rice, 2019], and bee [Bezemer et al., 2017] are two tools that also detect missing inputs, but they are tailored for Make-based builds. As we pointed out in 3.1, these tools have two important limitations that concern: (1) low precision when applied to Java build tools, and (2) efficiency & applicability. As we explained earlier, our approach tackles both limitations.
Beyond testing, a number of studies have been developed to identify the root causes of problematic builds, and suggest fixes for them. Al-Kofahi et al. 
, have designed a tool that given a failed build, it identifies the faulty Make rules that caused the build crash. Their approach performs an instrumentation on a Make build that tracks the execution trace of each Make rule, and records the crash point of the build. Based on a probabilistic model, they assign different scores to every rule, indicating the probability that the rule caused the crash. Subsequent work[Ren et al., 2018, 2019] have focused on locating faults of unreproducible builds. Reproducibility is a property that ensures that a build is deterministic and always results in bitwise-identical targets given the same sources and build environment. The initial work of Ren et al.  analyzes the logs from an unreproducible build, and proposes a ranked list of problematic source files that might contain the fault. Recently, the authors extended their approach [Ren et al., 2019] to locate the specific command that is responsible for the unreproducible build. To do so, they employed a backtracking analysis on system call trace stemming from build execution. Contrary to these approaches, our method does not require a build failure, but it is capable of detecting latent future faults.
Understanding and Refactoring Builds. There are plenty of tools developed over the past decade to assist developers in understanding and refactoring builds. Makao [Adams et al., 2007] is a Make-related framework used for visualizing build dependencies. By extracting knowledge from such dependencies through filtering and querying, Makao provides support for refactoring build scripts via an aspect-oriented approach. SYMake [Tamrawi et al., 2012], evaluates Makefiles and produces (1) a symbolic dependency graph, and (2) a symbolic execution trace. Then, it applies different algorithms to the results to detect a number of code smells (e.g., cyclic dependencies), and perform refactoring on Make scripts (e.g., target renaming). Metamorphosis [Gligoric et al., 2014] is a tool used to migrate existing build scripts to CloudMake [Christakis et al., 2014], which is a modern build system developed by Microsoft. As a starting point, Metamorphosis analyzes the execution trace of a given build and then automatically synthesizes an initial CloudMake script that reflects the behavior of the original script. Then, it optimizes the build script synthesized by the previous step by applying a sequence of transformations and choosing the best possible ones based on a fitness function. Vakilian et al.  propose a new refactoring method, target decomposition, for dealing with underutilized targets; a build-related code smell that causes slower builds, larger binaries, and less modular code.
Trace Analysis. Most of the existing work [Gligoric et al., 2014, van der Burg et al., 2014, Licker and Rice, 2019, Ammons, 2006, Ren et al., 2019, Sotiropoulos et al., 2019] that is relevant to the domain of builds employs techniques for analyzing traces—and especially system call traces. Our work differs from the previous approaches as the proposed model (buildfs), and in turn, its practical realization captures both the dynamic behavior and the static specification of high-level programming constructs (i.e., build tasks). This enables us to verify—contrary to existing approaches—the execution of a build phase with regards to its specification, while monitoring build, making our method more precise, efficient, and generally-applicable.
Regression Test Selection. Recently, there have been advances on dynamic regression test selection techniques (rts) [Gligoric et al., 2015, Wang et al., 2018, Celik et al., 2017, Zhang, 2018]. Dynamic rts methods improve the performance of regression testing by running only those tests affected by a specific code change. To do so, they compute test dependencies from previous test runs. Gligoric et al. , and Celik et al.  extract test dependencies by determining the execution boundaries of each test, and tracking all intermediate file accesses. rts methods and our technique are complementary; they can both used as part of a build to improve efficiency and reliability respectively.
We developed a generic and practical approach for discovering faults that can cause incremental and parallel build failures. To do so, we proposed a model (buildfs) for arbitrary build executions that captures the static specification and the dynamic behavior of each build task. We then formally defined three types of faults concerning incrementality and parallelism, and presented an approach for exploiting buildfs and detecting such faults in practice. Combining static and dynamic information in a single representation made our method efficient and applicable to any build system.
Our method was able to uncover issues in hundreds of Make and Gradle builds. Notably, our approach tackled the limitations of existing work, and it is the first to deal with Java-based build tools. We demonstrated the importance of the discovered faults by providing patches to numerous projects. Thanks to our tool, the developers of 45 open-source projects confirmed and fixed 235 issues, in total. Moreover, a comparison between our tool and a state-of-the-art Make-based tool showed that our approach is more effective and orders of magnitude faster. We argue that our tool could be part of the software testing pipeline, helping developers to discover defects and inconsistencies in their software artifacts that arise from faulty build definitions.
- Adams et al.  B. Adams, H. Tromp, K. de Schutter, and W. de Meuter. Design recovery and maintenance of build systems. In 2007 IEEE International Conference on Software Maintenance, pages 114–123, Oct 2007. doi: 10.1109/ICSM.2007.4362624.
- Al-Kofahi et al.  J. Al-Kofahi, H. V. Nguyen, and T. N. Nguyen. Fault localization for build code errors in Makefiles. In Companion Proceedings of the 36th International Conference on Software Engineering, ICSE Companion 2014, pages 600–601, New York, NY, USA, 2014. ACM. ISBN 978-1-4503-2768-8. doi: 10.1145/2591062.2591135. URL http://doi.acm.org/10.1145/2591062.2591135.
- Ammons  G. Ammons. Grexmk: Speeding up scripted builds. In Proceedings of the 2006 International Workshop on Dynamic Systems Analysis, WODA ’06, pages 81–87, New York, NY, USA, 2006. ACM. ISBN 1-59593-400-6.
- Avery Pennarun and Reinholdtsen  B. A. Avery Pennarun and P. Reinholdtsen. Debian popularity contest. https://popcon.debian.org/, 2020.
- Bazel  Bazel. Build and test software of any size, quickly and reliably. https://bazel.build, 2020.
- Bezemer et al.  C.-P. Bezemer, S. McIntosh, B. Adams, D. M. German, and A. E. Hassan. An Empirical Study of Unspecified Dependencies in Make-Based Build Systems. Empirical Software Engineering, 22(6):3117–3148, 2017.
- Bruening et al.  D. Bruening, Q. Zhao, and S. Amarasinghe. Transparent dynamic instrumentation. In Proceedings of the 8th ACM SIGPLAN/SIGOPS Conference on Virtual Execution Environments, VEE ’12, pages 133–144, New York, NY, USA, 2012. ACM. ISBN 978-1-4503-1176-2. doi: 10.1145/2151024.2151043. URL http://doi.acm.org/10.1145/2151024.2151043.
- Calcote  J. Calcote. Autotools: A Practitioner’s Guide to GNU Autoconf, Automake, and Libtool. No Starch Press, 2020.
- Celik et al.  A. Celik, A. Knaust, A. Milicevic, and M. Gligoric. Build system with lazy retrieval for Java projects. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2016, pages 643–654, New York, NY, USA, 2016. Association for Computing Machinery. ISBN 9781450342186. doi: 10.1145/2950290.2950358. URL https://doi.org/10.1145/2950290.2950358.
- Celik et al.  A. Celik, M. Vasic, A. Milicevic, and M. Gligoric. Regression test selection across JVM boundaries. In Symposium on the Foundations of Software Engineering, pages 809–820, 2017.
- Christakis et al.  M. Christakis, R. Leino, and W. Schulte. Formalizing and verifying a modern build language. In FM 2014: Formal Methods - 19th International Symposium, Singapore, May 12-16, 2014. Proceedings, volume 8442 of Lecture Notes in Computer Science, pages 643–657. Springer, May 2014. ISBN 978-3-319-06409-3. URL https://www.microsoft.com/en-us/research/publication/formalizing-and-verifying-a-modern-build-language/.
- Coetzee et al.  D. Coetzee, A. Bhaskar, and G. Necula. apmake: A reliable parallel build manager. In 2011 USENIX Annual Technical Conference (USENIX), 2011.
- Dashenkov  D. Dashenkov. Gradle task ordering constraints. https://github.com/SpineEventEngine/base/issues/516, 2020.
- Debian [2020a] Debian. sbuild. https://wiki.debian.org/sbuild, 2020a.
- Debian [2020b] Debian. Ultimatedebiandatabase. https://wiki.debian.org/UltimateDebianDatabase/, 2020b.
- Derr et al.  E. Derr, S. Bugiel, S. Fahl, Y. Acar, and M. Backes. Keep me updated: An empirical study of third-party library updatability on Android. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, CCS ’17, pages 2187–2200, New York, NY, USA, 2017. ACM. ISBN 978-1-4503-4946-8. doi: 10.1145/3133956.3134059. URL http://doi.acm.org/10.1145/3133956.3134059.
- Erdweg et al.  S. Erdweg, M. Lichter, and M. Weiel. A sound and optimal incremental build system with dynamic dependencies. SIGPLAN Not., 50(10):89–106, Oct. 2015. ISSN 0362-1340. doi: 10.1145/2858965.2814316. URL https://doi.org/10.1145/2858965.2814316.
- Feldman  S. I. Feldman. Make—a program for maintaining computer programs. Software: Practice & Experience, 9(4):255–265, 1979.
- Gligoric et al.  M. Gligoric, W. Schulte, C. Prasad, D. van Velzen, I. Narasamdya, and B. Livshits. Automated migration of build scripts using dynamic analysis and search-based refactoring. In Proceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages & Applications, OOPSLA ’14, pages 599–616, New York, NY, USA, 2014. ACM. ISBN 978-1-4503-2585-1. doi: 10.1145/2660193.2660239. URL http://doi.acm.org/10.1145/2660193.2660239.
- Gligoric et al.  M. Gligoric, L. Eloussi, and D. Marinov. Practical regression test selection with dynamic file dependencies. In International Symposium on Software Testing and Analysis, pages 211–222, 2015.
- GNU Make [2020a] GNU Make. Generating prerequisites automatically. https://www.gnu.org/software/make/manual/html_node/Automatic-Prerequisites.html, 2020a.
- GNU Make [2020b] GNU Make. Handling tools that produce many outputs. https://www.gnu.org/software/automake/manual/html_node/Multiple-Outputs.html, 2020b.
- Gradle Inc. [2020a] Gradle Inc. Build cache. https://docs.gradle.org/current/userguide/build_cache.html, 2020a.
- Gradle Inc. [2020b] Gradle Inc. Developing custom Gradle plugins. https://docs.gradle.org/current/userguide/custom_plugins.html, 2020b.
- Gradle Inc. [2020c] Gradle Inc. Gradle vs Maven: Performance comparison. https://gradle.org/gradle-vs-maven-performance/, 2020c.
- Gradle Inc. [2020d] Gradle Inc. Gradle - plugins. https://plugins.gradle.org/, 2020d.
- Greene  S. Greene. Introducing incremental build support. https://blog.gradle.org/introducing-incremental-build-support, 2015.
- Hassan et al.  F. Hassan, S. Mostafa, E. S. L. Lam, and X. Wang. Automatic building of Java projects in software repositories: A study on feasibility and challenges. In Proceedings of the 11th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, ESEM ’17, pages 38–47, Piscataway, NJ, USA, 2017. IEEE Press. ISBN 978-1-5090-4039-1. doi: 10.1109/ESEM.2017.11. URL https://doi.org/10.1109/ESEM.2017.11.
- Hilton et al.  M. Hilton, T. Tunnell, K. Huang, D. Marinov, and D. Dig. Usage, costs, and benefits of Continuous Integration in open-source projects. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering, ASE 2016, pages 426–437, New York, NY, USA, 2016. Association for Computing Machinery. ISBN 9781450338455. doi: 10.1145/2970276.2970358. URL https://doi.org/10.1145/2970276.2970358.
- Karanpuria and Roy  R. Karanpuria and A. S. Roy. Kotlin Programming Cookbook: Explore more than 100 recipes that show how to build robust mobile and web applications with Kotlin, Spring Boot, and Android. Packt Publishing Ltd, 2018.
- Konat et al.  G. Konat, S. Erdweg, and E. Visser. Scalable incremental building with dynamic task dependencies. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, ASE 2018, pages 76–86, New York, NY, USA, 2018. Association for Computing Machinery. ISBN 9781450359375. doi: 10.1145/3238147.3238196. URL https://doi.org/10.1145/3238147.3238196.
- Licker and Rice  N. Licker and A. Rice. Detecting incorrect build rules. In Proceedings of the 41st International Conference on Software Engineering, ICSE ’19, pages 1234–1244, Piscataway, NJ, USA, 2019. IEEE Press.
- Martin and Hoffman  K. Martin and B. Hoffman. Mastering CMake: a cross-platform build system. Kitware, 2010.
- McDougall et al.  R. McDougall, J. Mauro, and B. Gregg. Solaris Performance and Tools: DTrace and MDB Techniques for Solaris 10 and OpenSolaris. Prentice Hall PTR, Upper Saddle River, 2006. ISBN 0131568191.
- McIntosh et al.  S. McIntosh, B. Adams, T. H. Nguyen, Y. Kamei, and A. E. Hassan. An empirical study of build maintenance effort. In Proceedings of the 33rd International Conference on Software Engineering, ICSE ’11, pages 141–150, New York, NY, USA, 2011. Association for Computing Machinery. ISBN 9781450304450. doi: 10.1145/1985793.1985813. URL https://doi.org/10.1145/1985793.1985813.
- McIntosh et al.  S. McIntosh, M. Nagappan, B. Adams, A. Mockus, and A. E. Hassan. A large-scale empirical study of the relationship between build technology and build maintenance. Empirical Software Engineering, 20(6):1587–1633, Dec 2015. ISSN 1573-7616. doi: 10.1007/s10664-014-9324-x. URL https://doi.org/10.1007/s10664-014-9324-x.
- Morgenthaler et al.  J. D. Morgenthaler, M. Gridnev, R. Sauciuc, and S. Bhansali. Searching for build debt: Experiences managing technical debt at Google. In 2012 Third International Workshop on Managing Technical Debt (MTD), pages 1–6, June 2012. doi: 10.1109/MTD.2012.6225994.
- Nethercote and Seward  N. Nethercote and J. Seward. Valgrind: A framework for heavyweight dynamic binary instrumentation. In Proceedings of the 28th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’07, pages 89–100, New York, NY, USA, 2007. ACM. ISBN 978-1-59593-633-2. doi: 10.1145/1250734.1250746. URL http://doi.acm.org/10.1145/1250734.1250746.
- Pelgrims  K. Pelgrims. Gradle for Android. Packt Publishing Ltd, 2015.
- Ren et al.  Z. Ren, H. Jiang, J. Xuan, and Z. Yang. Automated localization for unreproducible builds. In Proceedings of the 40th International Conference on Software Engineering, ICSE ’18, pages 71–81, New York, NY, USA, 2018. ACM. ISBN 978-1-4503-5638-1. doi: 10.1145/3180155.3180224. URL http://doi.acm.org/10.1145/3180155.3180224.
- Ren et al.  Z. Ren, C. Liu, X. Xiao, H. Jiang, and T. Xie. Root cause localization for unreproducible builds via causality analysis over system call tracing. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 527–538, Nov 2019. doi: 10.1109/ASE.2019.00056.
- Rodriguez  R. Rodriguez. A system call tracer for UNIX. In USENIX Conference Proceedings, pages 72–80, Berkeley, CA, Summer 1986. USENIX Association.
- Sotiropoulos et al.  T. Sotiropoulos, D. Mitropoulos, and D. Spinellis. Detecting missing dependencies and notifiers in Puppet programs. arXiv preprint arXiv:1905.11070, 2019.
- Tamrawi et al.  A. Tamrawi, H. A. Nguyen, H. V. Nguyen, and T. N. Nguyen. SYMake: a build code analysis and refactoring tool for makefiles. In 2012 Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering, pages 366–369, Sep. 2012. doi: 10.1145/2351676.2351749.
- Vakilian et al.  M. Vakilian, R. Sauciuc, J. D. Morgenthaler, and V. Mirrokni. Automated decomposition of build targets. In Proceedings of the 37th International Conference on Software Engineering - Volume 1, ICSE ’15, pages 123–133, Piscataway, NJ, USA, 2015. IEEE Press. ISBN 978-1-4799-1934-5. URL http://dl.acm.org/citation.cfm?id=2818754.2818772.
- van der Burg et al.  S. van der Burg, E. Dolstra, S. McIntosh, J. Davies, D. M. German, and A. Hemel. Tracing software build processes to uncover license compliance inconsistencies. In Proceedings of the 29th ACM/IEEE International Conference on Automated Software Engineering, ASE ’14, pages 731–742, New York, NY, USA, 2014. ACM. ISBN 978-1-4503-3013-8.
- Visser et al.  J. Visser, S. Rigal, G. Wijnholds, and Z. Lubsen. Building Software Teams: Ten Best Practices for Effective Software Development. ” O’Reilly Media, Inc.”, 2016.
- Wang et al.  K. Wang, C. Zhu, A. Celik, J. Kim, D. Batory, and M. Gligoric. Towards refactoring-aware regression test selection. In International Conference on Software Engineering, pages 233–244, 2018.
- Zhang  L. Zhang. Hybrid regression test selection. In Proceedings of the 40th International Conference on Software Engineering, ICSE ’18, pages 199–209, New York, NY, USA, 2018. Association for Computing Machinery. ISBN 9781450356381. doi: 10.1145/3180155.3180198. URL https://doi.org/10.1145/3180155.3180198.