Categorizing Defects in Infrastructure as Code

09/21/2018 ∙ by Akond Rahman, et al. ∙ NC State University 0

Infrastructure as code (IaC) scripts are used to automate the maintenance and configuration of software development and deployment infrastructure. IaC scripts can be complex in nature, containing hundreds of lines of code, leading to defects that can be difficult to debug, and lead to wide-scale system discrepancies such as service outages at scale. Use of IaC scripts is getting increasingly popular, yet the nature of defects that occur in these scripts have not been systematically categorized. A systematic categorization of defects can inform practitioners about process improvement opportunities to mitigate defects in IaC scripts. The goal of this paper is to help software practitioners improve their development process of infrastructure as code (IaC) scripts by categorizing the defect categories in IaC scripts based upon a qualitative analysis of commit messages and issue report descriptions. We mine open source version control systems collected from four organizations namely, Mirantis, Mozilla, Openstack, and Wikimedia Commons to conduct our research study. We use 1021, 3074, 7808, and 972 commits that map to 165, 580, 1383, and 296 IaC scripts, respectively, collected from Mirantis, Mozilla, Openstack, and Wikimedia Commons. With 89 raters we apply the defect type attribute of the orthogonal defect classification (ODC) methodology to categorize the defects. We also review prior literature that have used ODC to categorize defects, and compare the defect category distribution of IaC scripts with 26 non-IaC software systems. Respectively, for Mirantis, Mozilla, Openstack, and Wikimedia Commons, we observe (i) 49.3 contain syntax and configuration-related defects; (ii) syntax and configuration-related defects are more prevalent amongst IaC scripts compared to that of previously-studied non-IaC software.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 25

page 27

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Continuous deployment (CD) is the process of rapidly deploying software or services automatically to end-users (Rahman et al, 2015). The practice of infrastructure as code (IaC) scripts is essential to implement an automated deployment pipeline, which facilitates CD (Humble and Farley, 2010). Information technology (IT) organizations, such as Netflix 111https://www.netflix.com/, Ambit Energy 222https://www.ambitenergy.com/, and Wikimedia Commons 333https://commons.wikimedia.org/wiki/Main_Page, use IaC scripts to automatically manage their software dependencies, and construct automated deployment pipelines (Parnin et al, 2017) (Puppet, 2018) (Rahman and Williams, 2018). Commercial IaC tools, such as Ansible 444https://www.ansible.com/ and Puppet 555https://puppet.com/, provide multiple utilities to construct automated deployment pipelines. For example, Puppet provides the ‘sshkey resource’ to install and manage secure shell (SSH) host keys and the ‘service resource’ to manage software services automatically (Labs, 2017). Use of IaC scripts has helped IT organizations to increase their deployment frequency. For example, Ambit Energy, uses IaC scripts to increased their deployment frequency by a factor of 1,200 (Puppet, 2018).

Similar to software source code, IaC scripts can be complex and contain hundreds of lines of code (LOC). For example, Jiang and Adams (Jiang and Adams, 2015) have reported median size of an IaC script to vary between 1,398 and 2,486 LOC. Not applying traditional software engineering practices to IaC scripts can be detrimental, often leading IT organizations ‘to shoot themselves in the foot’ (Parnin et al, 2017). IaC scripts are susceptible to human errors (Parnin et al, 2017), bad coding practices (Cito et al, 2015), and can experience frequent churn, making scripts susceptible to defects (Jiang and Adams, 2015) (Parnin et al, 2017). Defects in IaC scripts can have serious consequences as IT organizations such as Wikimedia use these scripts to provision their development and deployment servers, and ensure availability of services 666https://blog.wikimedia.org/2011/09/19/ever-wondered-how-the-wikimedia-servers-are-configured/. Any defect in a script can propagate at scale, leading to wide-scale service outages. For example in January 2017, execution of a defective IaC script erased home directories of around 270 users in cloud instances maintained by Wikimedia 777https://wikitech.wikimedia.org/wiki/Incident_documentation/20170118-Labs. Prior research studies (Jiang and Adams, 2015) (Parnin et al, 2017) and the above-mentioned evidence demonstrated in the real-world, motivate us to systematically study the defects that occur in IaC scripts.

Categorization of defects can guide IT organizations on how to improve their development process. Chillarege et al. (Chillarege et al, 1992) proposed the orthogonal defect classification (ODC) technique which included a set of defect categories. According to Chillarege et al. (Chillarege et al, 1992), each of these defect categories map to a certain activity of the development process which needs attention. For example, the prevalence of functional defects can highlight process improvement opportunities in the design phase of the development process. Since the introduction of ODC in 1992, researchers and industry practitioners have widely used ODC to categorize defects for non-IaC software systems written in general purpose programming languages (GPLs) such as, database (Pecchia and Russo, 2012), operating systems (Cotroneo et al, 2013), and safety critical systems for spacecraft (Lutz and Mikulski, 2004). Unlike non-IaC software, IaC scripts use domain specific languages (DSLs) (Shambaugh et al, 2016). DSLs are fundamentally different from GPLs with respect to comprehension, semantics, and syntax (Fowler, 2010) (Voelter, 2013). A systematic categorization of defects in IaC scripts can help in understanding the nature of IaC defects, and provide actionable recommendations for practitioners to mitigate defects in IaC scripts.

The goal of this paper is to help practitioners improve their development process of infrastructure as code (IaC) scripts by categorizing the defects in IaC scripts based upon a qualitative analysis of commit messages and issue report descriptions.

We investigate the following research questions:

RQ1: What process improvement recommendations can we make for infrastructure as code development, by categorizing defects using the defect type attribute of orthogonal defect classification?

RQ2: What are the differences between infrastructure as code (IaC) and non-IaC software process improvement activities, as determined by their defect category distribution reported in the literature?

RQ3: Can the size of infrastructure as code (IaC) scripts provide a signal for improvement of the IaC development process?

RQ4: How frequently do defects occur in infrastructure as code scripts?

We use open source datasets from four organizations, Mirantis, Mozilla, Openstack, and Wikimedia Commons, to answer the four research questions. We use 1021, 3074, 7808, and 972 commits that map to 165, 580, 1383, and 296 IaC scripts, respectively, collected from Mirantis, Mozilla, Openstack, and Wikimedia Commons. With the help of 89 raters, we apply qualitative analysis to commit messages from repositories to analyze the defects that occur in IaC scripts. We apply the defect type attribute of ODC (Chillarege et al, 1992) to categorize the defects that occur in the IaC scripts. We compare the distribution of defect categories found in the IaC scripts with the categories for 26 non-IaC software systems, as reported in prior research studies, which used the defect type attribute of ODC and were collected from IEEEXplore 888https://ieeexplore.ieee.org/Xplore/home.jsp, ACM Digital Library 999https://dl.acm.org/, ScienceDirect 101010https://www.sciencedirect.com/, and SpringerLink 111111https://link.springer.com/.

We list our contributions as following:

  • [leftmargin=*]

  • A categorization of defects that occur in IaC scripts;

  • A comparison of the distribution of IaC defect categories to that of non-IaC software found in prior academic literature.; and

  • A set of curated datasets where a mapping of defect categories and IaC scripts are provided.

We organize the rest of the paper as following: Section 2 provides background information and prior research work relevant to our paper. Section 3 describes our methodology conducted for this paper. We use Section 4 to describe our case studies. We present our findings in Section 5. We discuss our findings in Section 6. We list the limitations of our paper in Section 7. Finally, we conclude the paper in Section 8.

2 Background and Related Work

In this section, we provide background on IaC scripts and briefly describe related academic research.

2.1 Background

IaC is the practice of automatically defining and managing network and system configurations, and infrastructure through source code (Humble and Farley, 2010). Companies widely use commercial tools such as Puppet to implement the practice of IaC (Humble and Farley, 2010) (Jiang and Adams, 2015) (Shambaugh et al, 2016). We use Puppet scripts to construct our dataset because Puppet is considered as one of the most popular tools for configuration management (Jiang and Adams, 2015) (Shambaugh et al, 2016), and has been used by companies since 2005 (McCune and Jeffrey, 2011). Typical entities of Puppet include modules and manifests (Labs, 2017). A module is a collection of manifests. Manifests are written as scripts that use a .pp extension.

In a single manifest script, configuration values can be specified using variables and attributes. Puppet provides the utility ‘class’ that can be used as a placeholder for the specified variables and attributes. For better understanding, we provide a sample Puppet script with annotations in Figure 1. For attributes configuration values are specified using the ‘’ sign, whereas, for variables, configuration values are provided using the ‘=’ sign. A single manifest script can contain one or multiple attributes and/or variables. In Puppet, variables store values and have no relationship with resources. Attributes describe the desired state of a resource. Similar to general purpose programming languages, code constructs such as functions/methods, comments, and conditional statements are also available for Puppet scripts.

Figure 1: Annotation of an example Puppet script.

2.2 Related Work

Our paper is related to empirical studies that have focused on IaC technologies, such as Puppet. Sharma et al. (Sharma et al, 2016) investigated smells in IaC scripts and proposed 13 implementation and 11 design configuration smells. Hanappi et al. (Hanappi et al, 2016) investigated how convergence of Puppet scripts can be automatically tested, and proposed an automated model-based test framework. Jiang and Adams (Jiang and Adams, 2015) investigated the co-evolution of IaC scripts and other software artifacts, such as build files and source code. They reported IaC scripts to experience frequent churn. In a recent work, Rahman and Williams (Rahman and Williams, 2018) characterized defective IaC scripts using text mining, and created prediction models using text feature metrics. Bent et al. (van der Bent et al, 2018) proposed and validated nine metrics to detect maintainability issues in IaC scripts. Rahman et el. (Rahman et al, 2017) investigated which factors influence usage of IaC tools. In another work, Rahman et al. (Rahman et al, 2018) investigated the questions that programmers ask on Stack Overflow to identify the potential challenges programmers face while working with Puppet.

The above-mentioned studies motivate us to explore the area of IaC by taking a different stance: we analyze the defects that occur in Puppet scripts and categorize them using the defect type attribute of ODC for the purpose of providing process improvement decisions.

Our paper is also related to prior research studies that have categorized defects of software systems using ODC and non-ODC techniques. We briefly describe the related studies as following:

Chillarege et al. (Chillarege et al, 1992) introduced the ODC technique in 1992. They proposed eight categories as a ‘defect type attribute’ that are orthogonal to each other. Since then, researchers have used the defect type attribute of ODC to categorize defects that occur in software. Duraes and Madeira (Duraes and Madeira, 2006) studied 668 faults from 12 software systems and reported that 43.4% of the 668 defects were algorithm defects and 21.9% of the defects were assignment-related defects. Freimut et al. (Freimut et al, 2005) applied the defect type attribute of ODC in an industrial setting. Fonseca et al. (Fonseca and Vieira, 2008) used ODC to categorize security defects that appear in web applications. They collected 655 security defects from six PHP web applications and reported 85.3% of the security defects belong to the algorithm category. Zheng et al. (Zheng et al, 2006) applied ODC on telecom-based software systems, and observed an average of 35.4% of defects belong to the algorithm category. Lutz and Mikluski (Lutz and Mikulski, 2004) studied defect reports from seven missions of NASA, and observed functional defects to be the most frequent category of 199 reported defects. Christmasson and Chillarege (Christmansson and Chillarege, 1996) studied 408 faults extracted for the IBM OS, and reported 37.7% and 19.1% of these faults to belong to the algorithm and assignment categories, respectively. Basso et al. (Basso et al, 2009) studied defects from six Java-based software namely, Azureus, FreeMind, Jedit, Phex, Struts, and Tomcat, and observed the most frequent category to be algorithm defects. Overall, they reported majority of the defects to belong to two categories: algorithm and assignment. Cinque et al. (Cinque et al, 2014)

analyzed logs from an industrial system that belong to the air traffic control system. In their paper, 58.9% of the 3,159 defects were classified as algorithm defects.

The above-mentioned studies highlight the research community’s interest in systematically categorizing the defects in different software systems. These studies focus on non-IaC software which highlights the lack of studies that investigate defect categories in IaC, and motivate us further to investigate defect categories of IaC scripts. Categorization of IaC-related defects can help practitioners provide actionable recommendations on how to mitigate IaC-related defects, and improve the quality of their developed IaC scripts.

3 Categorization Methodology

In this section, we first provide definitions related to our research study, and then provide the details of the methodology we used to categorize the IaC defects.

  • Defect: An imperfection in an IaC script that needs to be replaced or repaired (IEEE, 2010).

  • Defect-related commit: A commit whose message indicates that an action was taken related to a defect.

  • Defective script: An IaC script which is listed in a defect-related commit.

  • Neutral script: An IaC script for which no defect has been found yet.

3.1 Dataset Construction

Our methodology of dataset construction involves two steps: repository collection (Section 3.1.1) and commit message processing (Section 3.1.2).

3.1.1 Repository Collection

We use open source repositories to construct our datasets. An open source repository contains valuable information about the development process of an open source project, but the project might have a short development period (Munaiah et al, 2017). This observation motivates us to apply the following selection criteria to identify repositories for mining:

  • Criteria-1: The repository must be available for download.

  • Criteria-2: At least 11% of the files belonging to the repository must be IaC scripts. Jiang and Adams (Jiang and Adams, 2015) reported that in open source repositories IaC scripts co-exist with other types of files, such as source code and Makefiles files. They observed a median of 11% of the files to be IaC scripts. By using a cutoff of 11% we assume to collect a set of repositories that contain a sufficient amount of IaC scripts for analysis.

  • Criteria-3: The repository must have at least two commits per month. Munaiah et al. (Munaiah et al, 2017) used the threshold of at least two commits per month to determine which repositories have enough activity to develop software for IT organizations. We use this threshold to filter repositories that contain projects with short development activity.

3.1.2 Commit Message Processing

Prior research (Ray et al, 2016)(Zhang et al, 2016)(Zhang et al, 2017) leveraged open source repositories that use VCS for software defect studies. We use two artifacts from VCS of the selected repositories from Section 3.1.1, to construct our datasets: (i) commits that indicate modification of IaC scripts; and (ii) issue reports that are linked with the commits. We use commits because commits contain information on how and why a file was changed. Commits can also include links to issue reports. We use issue report summaries because they can give us more insights on why IaC scripts were changed in addition to what is found in commit messages. We collect commits and other relevant information in the following manner:

  • First, we extract commits that were used to modify at least one IaC script. A commit lists the changes made on one or multiple files (Alali et al, 2008).

  • Second, we extract the message of the commit identified from the previous step. A commit includes a message, commonly referred to as a commit message. The commit messages indicate why the changes were made to the corresponding files (Alali et al, 2008).

  • Third, if the commit message included a unique identifier that maps the commit to an issue in the issue tracking system, we extract the identifier and use that identifier to extract the summary of the issue. We use regular expressions to extract the issue identifier. We use the corresponding issue tracking API to extract the summary of the issue; and

  • Fourth, we combine the commit message with any existing issue summary to construct the message for analysis. We refer to the combined message as ‘extended commit message (XCM)’ throughout the rest of the paper. We use the extracted XCMs to categorize defects, as described in Section 3.2.

3.2 Categorization of Infrastructure as Code (IaC) Script Defects

We use the defect type attribute of ODC to categorize defects. We select the ODC defect type attribute because ODC uses semantic information collected from the software system to categorize defects (Chillarege et al, 1992). According to the ODC defect type attribute, a defect can belong to one of the eight categories: algorithm (AL), assignment (AS), build/package/merge (B), checking (C), documentation (D), function (F), interface (I), and timing/serialization (T).

The collected XCMs derived from commits and issue report descriptions might correspond to feature enhancement or performing maintenance tasks, which are not related to defects. As an XCM might not correspond to a defect, we added a ‘No defect (N)’ category. Furthermore, a XCM might not to belong to any of the eight categories included in the ODC defect type attribute. Hence, we introduced the ‘Other (O)’ category. We categorize the XCMs into one of these 10 categories. In case of the eight categories, we follow the criteria provided by Chillarege et al. (Chillarege et al, 1992), and used two of our own criteria for categories ‘No defect’, and ‘Other’. The criteria for each of the 10 categories are described in Table 1.

In Table 1 the ‘Process Improvement Activity Suggested By ODC’ column corresponds to one or multiple activities related to software development which can be improved based on the defect categories determined by ODC. For example, according to ODC, algorithm-related defects should peak during the activities of coding, code inspection, unit testing, and function testing. To reduce the propagation of algorithm-related defects, software teams can invest more effort for previously-mentioned activities. The ‘Not Applicable’ cells correspond to no applicable process improvement recommendations.

Category Criterion Process Improvement Activity
Algorithm (AL) Indicates efficiency or correctness problems that affect task and can be fixed by re-implementing an algorithm or local data structure. Coding, Code Inspection, Unit Testing, Function Testing
Assignment (AS) Indicates changes in a few lines of code. Code Inspection, Unit Testing
Build/Package/Merge (B) Indicates defects due to mistakes in change management, library systems, or VCS. Coding, Low Level Design
Checking (C) Indicates defects related to data validation and value checking. Coding, Code Inspection, Unit Testing
Documentation (D) Indicates defects that affect publications and maintenance notes. Coding, Publication
Function (F) Indicates defects that affect significant capabilities. Design, High Level Design Inspection, Function Testing
Interface (I) Indicates defects in interacting with other components, modules, or control blocks. Coding, System Testing
No Defect (N) Indicates no defects. Not Applicable
Other (O) Indicates a defect that do not belong to the categories: AL, AS, B, C, D, F, I, N, and T. Not Applicable
Timing/Serialization (T) Indicates errors that involve real time resources and shared resources. Low Level Design, Low Level Design Inspection, System Testing
Table 1: Criteria for Determining Defect Categories Based on ODC, adapted from (Chillarege et al, 1992)

We perform qualitative analysis on the collected XCMs to determine the category to which a commit belongs. Qualitative analysis provides the opportunity to increase the quality of the constructed dataset (Herzig et al, 2013). For performing qualitative analysis, we had raters with software engineering experience apply the 10 categories stated in Table 1 on the collected XCMs. We record the amount of time they took to perform the categorization.

We conduct the qualitative analysis in the following manner:

  • Categorization Phase: We randomly distribute the XCMs in a manner so that each XCM is reviewed by at least two raters. We adopt this approach to mitigate the subjectivity introduced by a single rater. Each rater determine the category of an XCM using the 10 categories presented in Table 1. We provide raters with an electronic handbook on IaC (Labs, 2017), and the ODC publication (Chillarege et al, 1992). We do not provide any time constraint for the raters to categorize the defects. We record the agreement level amongst raters using two techniques: (a) by counting the XCMs for which the raters had the same rating; and (b) by computing the Cohen’s Kappa score (Cohen, 1960).

  • Resolution Phase: For any XCM, raters can disagree on the identified category. In these cases, we use an additional rater’s opinion to resolve such disagreements. We refer to the additional rater as the ‘resolver’.

  • Practitioner Agreement: To evaluate the ratings in the categorization and the resolution phase, we randomly select 50 XCMs for each dataset. We contact practitioners who authored the commit message via e-mails. We ask the practitioners if they agree to our categorization of XCMs. High agreement between the raters’ categorization and practitioner’s feedback is an indication of how well the raters performed. The percentage of XCMs to which practitioners agreed upon are recorded and the Cohen’s Kappa score is computed.

After applying qualitative analysis based on the 10 defect categories, we find a mapping between each XCM and a category. We list an example XCM that belongs to each of the 10 defect categories in Table 2. We refer to a XCM which is not categorized as ‘N’ as a ‘defect message’ throughout the paper. In a similar manner, a XCM categorized as ‘AS’, identifies that the corresponding commit is related to the category ‘Assignment’, and the IaC scripts modified in this commit contains an assignment-related defect.

From the defect message, we separate defect-related commits. Each defect-related commit corresponds to a defect in a script. From the defect-related commits, we determine which IaC scripts are defective, similar to prior work (Zhang et al, 2016). We determine a script as defective if it contains at least one defect. Defect-related commits list which IaC scripts were changed, and from this list we determine which IaC scripts are defective.

Category Mirantis Mozilla Openstack Wikimedia
Algorithm fixing deeper hash merging for firewall bug 869897: make watch_devices.sh logs no longer infinitely growing; my thought is logrotate.d but open to other choices here fix middleware order of proxy pipeline and add missing modules this patch fixes the order of the middlewares defined in the swift proxy server pipeline nginx service should be stopped and disabled when nginx is absent
Assignment fix syntax errors this commit removes a couple of extraneous command introduced by copy/past errors bug 867593 use correct regex syntax resolved syntax error in collection fix missing slash in puppet file url
Build/Package/Merge params should have been imported; it was only working in tests by accident bug 774638-concat::setup should depend on diffutils fix db_sync dependencies: this patch adds dependencies between the cinder-api and cinder-backup services to ensure that db_sync is run before the services are up. change-id: i7005 fix varnish apt dependencies these are required for the build it does. also; remove unnecessary package requires that are covered by require_package
Checking ensure we have iso/ directory in jenkins workspace we share iso folder for storing fuel isos which are used for fuel-library tests bug 1118354: ensure deploystudio user uid is 500 fix check on threads fix hadoop-hdfs-zkfc-init exec unless condition $zookeeper_hosts_string was improperly set; since it was created using a non-existent local var ’@zoookeeper_hosts’
Documentation fixed comments on pull. bug 1253309 - followup fix to review comments fix up doc string for workers variable change-id:ie886 fix hadoop.pp documentation default
Function fix for jenkins swarm slave variables bug 1292523-puppet fails to set root password on buildduty-tools server make class rally work class rally is created initially by tool fix ve restbase reverse proxy config move the restbase domain and ‘v1’ url routing bits into the apache config rather than the ve config.
Interface fix for iso build broken dependencies bug 859799-puppetagain buildbot masters won’t reconfig because of missing sftp subsystem update all missing parameters in all manifests fix file location for interfaces
No Error added readme merge bug 1178324 from default add example for cobbler new packages for the server
Other fuel-stats nginx fix summary: bug 797946: minor fixups; r=dividehex minor fixes fix nginx configpackageservice ordering
Timing/Serialization fix /etc/timezone file add newline at end of file /etc/timezone bug 838203-test_alerts.html times out on ubuntu 12.04 vm fix minimal available memory check change-id:iaad0 fix hhvm library usage race condition ensure that the hhvm lib ‘current’ symlink is created before setting usrbinphp to point to usrbinhhvm instead of usrbinphp5. previously there was a potential for race conditions due to resource ordering rules
Table 2: Example of Extended Commit Messages (XCMs) for Defect Categories

3.3 RQ1: What process improvement recommendations can we make for infrastructure as code development, using the defect type attribute of orthogonal defect classification?

By answering RQ1 we focus on identifying the defect categories that occur in IaC scripts. We use the 10 categories stated in Table 1 to identify the defect categories. We answer RQ1 using the categorization achieved through qualitative analysis and by reporting the count of defects that belong to each defect category. We use the metric Defect Count for Category () calculated using Equation 1.

(1)

Answers to RQ1 will give us an overview on the distribution of defect categories for IaC scripts.

3.4 RQ2: What are the differences between infrastructure as code (IaC) and non-IaC software process improvement activities, as determined by their defect category distribution reported in the literature?

We answer RQ2 by identifying prior research that have used the defect type attribute of ODC to categorize defects in other systems such as safety critical systems (Lutz and Mikulski, 2004), and operating systems (Christmansson and Chillarege, 1996). We collect necessary publications based on the following selection criteria:

  • Step-1: The publication must cite Chillarege et al. (Chillarege et al, 1992)’s ODC publication, be indexed by ACM Digital Library or IEEE Xplore or SpringerLink or ScienceDirect, and have been published on or after 2000. By selecting publications on or after 2000 we assume to obtain defect categories of software systems that are relevant and comparable against modern software systems such as IaC.

  • Step-2: The publication must use the defect type attribute of ODC in its original form to categorize defects of a software system. A publication might cite Chillarege et al. (Chillarege et al, 1992)’s paper as related work, and not use the ODC defect type attribute for categorization. A publication might also modify the ODC defect type attribute to form more defect categories, and use the modified version of ODC to categorize defects. As we use the original eight categories of ODC’s defect type attribute, we do not include publications that modified or extended the defect type attribute of ODC to categorize defects.

  • Step-3: The publication must explicitly report the software systems they studied with a distribution of defects across the ODC defect type categories, along with the total count of bugs/defects/faults for each software system. Along with defects, we consider bugs and faults, as in prior work researchers have used bugs (Thung et al, 2012) and faults interchangeably with defects (Pecchia and Russo, 2012).

Our answer to RQ2 provides a list of software systems with distribution of defects categorized using the defect type attribute of ODC. For each software system, we report the main programming language it was built, and the reference publication.

3.5 RQ3: Can the size of infrastructure as code (IaC) scripts provide a signal for improvement of the IaC development process?

Researchers in prior work (Moller and Paulish, 1993) (Fenton and Ohlsson, 2000) (Hatton, 1997) have observed the relationships of source code size and defects for software written in GPLs. Similar investigation for IaC scripts can provide valuable insights, and may help practitioners in identifying scripts that contain defects or specific categories of defects. We investigate if size of IaC scripts is related to defective scripts by executing the following two steps:

  • [leftmargin=*]

  • First, we quantify if defective and neutral scripts in the constructed datasets significantly differ in size. We calculate the size of each defective and neutral script by measuring LOC. Next we apply statistical measurements between the two groups: defective and neutral scripts. The statistical measurements are: Mann Whitney U test (Mann and Whitney, 1947) and effect size calculation with Cliffs Delta (Cliff, 1993). Both Mann Whitney U test and Cliffs Delta measures are non-parametric. The Mann Whitney U test states if one distribution is significantly large/smaller than the other, whereas effect size using Cliffs Delta measures how large the difference is.

    Following convention, we report a distribution to be significantly larger than the other if . We use Romano et al.’s recommendations to interpret the observed Cliffs Delta values. According to Romano et al. (Romano et al, 2006), the difference between two groups is ‘large’ if Cliffs Delta is greater than 0.47. A Cliffs Delta value between 0.33 and 0.47 indicates a ‘medium’ difference. A Cliffs Delta value between 0.14 and 0.33 indicates a ‘small’ difference. Finally, a Cliffs Delta value less than 0.14 indicates a ‘negligible’ difference.

    Upon completion of this step we will (i) identify if size of IaC scripts are significantly different between defective and neutral scripts, (ii) identify how large the difference is using effect size, and (iii) report median values for defective and neutral scripts.

  • Second, we compare how size of IaC scripts vary between IaC defect categories by calculating LOC of scripts that belong to each of the nine defect categories, reported in Table 1

    . We apply a variant of the Scott-Knott (SK) test to compare if the categories of defects significantly vary from each other with respect to size. This variant of SK does not assume input to have a normal distribution and accounts for negligible effect size 

    (Tantithamthavorn et al, 2017)

    . SK uses hierarchical clustering analysis to partition the input data into significantly (

    ) distinct ranks (Tantithamthavorn et al, 2017). Using the SK test, we can determine if the size of a defect category is significantly higher than the other groups. As a hypothetical example, if defect category ‘AS’ ranks higher than that of ‘AL’, then we can state that IaC scripts that have ‘AS’-related defects are significantly larger than scripts with ‘AL’-related defects. Along with, reporting the SK ranks we also present the distribution of script sizes in forms of boxplots.

3.6 RQ4: How frequently do defects occur in infrastructure as code scripts?

We answer RQ4 by quantifying defect density and temporal frequency of IaC scripts. We calculate defect density by counting defects that appear per 1000 lines (KLOC) of IaC script, similar to prior work (Battin et al, 2001) (Mohagheghi et al, 2004) (Hatton, 1997). We use Equation 2 to calculate defect density ().

(2)

provide us the defect density of the four IaC datasets and gives an assessment of how frequently defects occur in IaC scripts. We select this measure of defect density, as this measure has been used as an industry standard to (i) establish a baseline for defect frequency; and (ii) helps to assess the quality of the software (Harlan, 1987) (McConnell, 2004).

Temporal frequency provides an overview on how frequently defects appear with the evolution of time. We compute the proportion of defects that occur every month to quantify temporal frequency of IaC defects. We use the metric ‘Overall Temporal Frequency’ and calculate this metric using Equation 3:

(3)

For Overall Temporal Frequency, we apply the Cox-Stuart test (Cox and Stuart, 1955) to determine if the exhibited trend is significantly increasing or decreasing. The Cox Stuart test is a statistical technique that compares the earlier data points to the later data points in a time series to determine whether or not the trend observant from the time series data is increasing or decreasing with statistical significance. We use a 95% statistical confidence level to determine which topics exhibit increasing or decreasing trends. To determine temporal frequencies for both, overall defects and category-wise defects, we apply the following strategy:

  • [leftmargin=*]

  • if Cox-Stuart test output states the temporal frequency values are ‘increasing’ with a p-value , we determine the temporal trend to be ‘increasing’.

  • if Cox-Stuart test output states the temporal frequency values are ‘decreasing’ with a p-value , we determine the temporal trend to be ‘decreasing’.

  • if we cannot determine if the temporal trend as ‘increasing’ or ‘decreasing’, then we determine the temporal trend to be ‘consistent’.

4 Datasets

We construct case studies using Puppet scripts from open source repositories maintained by four organizations: Mirantis, Mozilla, Openstack, and Wikimedia Commons. We select Puppet because it is considered as one of the most popular tools to implement IaC (Jiang and Adams, 2015) (Shambaugh et al, 2016), and has been used by IT organizations since 2005 (McCune and Jeffrey, 2011). Mirantis is an organization that focuses on the development and support of cloud services such as OpenStack 121212https://www.mirantis.com/. Mozilla is an open source software community that develops, uses, and supports Mozilla products such as Mozilla Firefox 131313https://www.mozilla.org/en-US/. Openstack foundation is an open-source software platform for cloud computing where virtual servers and other resources are made available to customers 141414https://www.openstack.org/. Wikimedia Foundation is a non-profit organization that develops and distributes free educational content 151515https://wikimediafoundation.org/.

4.1 Repository Collection

We apply the three selection criteria presented in Section 3.1.1 to identify the repositories that we use for analysis. We describe how many of the repositories satisfied each of the three criteria as following:

  • Criteria-1: Altogether, 26, 1594, 1253, and 1638 repositories were publicly available to download for Mirantis, Mozilla, Openstack, and Wikimedia Commons, respectively. We download the repositories from their corresponding online project management systems [Mirantis (Mirantis, 2018), Mozilla (Mozilla, 2018), Openstack (Openstack, 2018), and Wikimedia (Commons, 2018)]. The Mozilla repositories were Mercurial-based, whereas, Mirantis, Openstack and Wikimedia repositories were Git-based.

  • Criteria-2: For Criteria-2 we stated that at least of 11% of all the files belonging to the repository must be Puppet scripts. For Mirantis, 20 of the 26 repositories satisfied Criteria-2. For Mozilla, 2 of the 1,594 repositories satisfied Criteria-2. For Openstack, 61 of the 1253 repositories satisfied Criteria-2. For Wikimedia Commons, 11 out of the 1,638 repositories satisfied Criteria-2. Altogether 74 of the 4,485 repositories satisfy Criteria-2, indicating the amount of IaC scripts to be small compared to the organizations’ overall codebase.

  • Criteria-3: As Criteria-3 we stated that the repository must have at least two commits per month. The 20, 2, 61, and 11 selected repositories respectively, for Mirantis, Mozilla, Openstack, and Wikimedia Commons that satisfy Criteria-2 also satisfy Criteria-3.

We perform our analysis on 20, 2, 61, and 11 repositories from Mirantis, Mozilla, Openstack, and Wikimedia Commons. We refer to the datasets from Mirantis, Mozilla, Openstack, and Wikimedia Commons as ‘Mirantis’, ‘Mozilla’, ‘Openstack’, and ‘Wikimedia’, respectively.

4.2 Commit Message Processing

For Mirantis, Mozilla, Openstack, and Wikimedia, altogether we respectively collect 2749, 60992, 31460, and 14717 commits, from 20, 2, 61, and 11 repositories. Of these 2749, 60992, 31460, and 14717 commits, respectively, in 1021, 3074, 7808, and 972, commits at least one Puppet script is modified. As shown in Table 3, for Mirantis we collect 91 Puppet scripts that map to 1,021 commits from the 20 repositories. For Mozilla we collect 580 Puppet scripts that map to 3,074 commits from the two repositories. For Openstack, we collect 1,383 Puppet scripts that map to 7,808 commits from the 61 repositories. For Wikimedia, we collect 296 Puppet scripts that map to 972 commits from the 11 repositories. Of the 3074, 7808, and 972 commits, 2764, 2252, and 210 commit messages included unique identifiers that map to issues in their respective issue tracking systems. Using these unique identifiers to issue reports, we construct the XCMs.

4.3 Determining Categories of Defects

Altogether, we had 89 raters who determined the categories of defect-related commits. Of the 89 raters, three are PhD students and co-authors of this paper. The rest of the 86 students are recruited from a graduate course related to software engineering, titled ‘Software Security’. To recruit the 86 graduate students we follow the Institutional Review Board (IRB) Protocol (IRB#9521 and IRB#1230). We use the 89 raters to categorize the XCMs, using the following phases:

  • Categorization Phase:

    • Mirantis: We recruit students in a graduate course related to software engineering titled ‘Software Security’ via e-mail. The number of students in the class was 58, and 32 students agreed to participate. We follow Internal Review Board (IRB) protocol, IRB#12130, in recruitment of students and assignment of defect categorization tasks. We randomly distribute the 1021 XCMs amongst the students such that each XCM is rated by at least two students. The average professional experience of the 32 students in software engineering is two years. On average, each student took 2.1 hours.

    • Mozilla: The second and third author of the paper, separately apply qualitative analysis on 3,074 XCMs. The second and third author, respectively, have a professional experience of three and two years in software engineering. The second and third author respectively took 37.0 and 51.2 hours to complete the categorization.

    • Openstack: The second and fourth author of the paper, separately apply qualitative analysis on 7,808 XCMs from Openstack repositories. The second and fourth author, respectively, have a professional experience of two and one years in software engineering. The second and fourth author completed the categorization of the 7,808 XCMs respectively, in 80.0 and 130.0 hours.

    • Wikimedia: 54 graduate students recruited from the ‘Software Security’ course are the raters. We randomly distribute the 972 XCMs amongst the students such that each XCM is rated by at least two students. According to our distribution, 140 XCMs are assigned to each student. The average professional experience of the 54 students in software engineering is 2.3 years. On average, each student took 2.1 hours to categorize the 140 XCMs.

  • Resolution Phase:

    • Mirantis: Of the 1,021 XCMs, we observe agreement for 509 XCMs and disagreement for 512 XCMs, with a Cohen’s Kappa score of 0.21. Based on Cohen’s Kappa score, the agreement level is ‘fair’ (Landis and Koch, 1977).

    • Mozilla: Of the 3,074 XCMs, we observe agreement for 1,308 XCMs and disagreement for 1,766 XCMs, with a Cohen’s Kappa score of 0.22. Based on Cohen’s Kappa score, the agreement level is ‘fair’ (Landis and Koch, 1977).

    • Openstack: Of the 7,808 XCMs, we observe agreement for 3,188 XCMs, and disagreements for 4,620 XCMs. The Cohen’s Kappa score was 0.21. Based on Cohen’s Kappa score, the agreement level is ‘fair’ (Landis and Koch, 1977).

    • Wikimedia: Of the 972 XCMs, we observe agreement for 415 XCMs, and disagreements for 557 XCMs, with a Cohen’s Kappa score of 0.23. Based on Cohen’s Kappa score, the agreement level is ‘fair’ (Landis and Koch, 1977).

    First author of the paper is the resolver, and resolve disagreements for all four datasets. We observe that the raters agreement level to be ‘fair’ for all four datasets. One possible explanation can be that the raters agreed on whether an XCM is defect-related, but disagreed on what the category of the defect is. For defect categorization, fair or poor agreement amongst raters however, is not uncommon. Henningsson et al. (Henningsson and Wohlin, 2004) also reported a low agreement amongst raters.

    Practitioner Agreement: We report the agreement level between the raters’ and the practitioners’ categorization for randomly selected 50 XCMs as following:

    • Mirantis: We contact three programmers and all of them responded. We observe a 89.0% agreement with a Cohen’s Kappa score of 0.8. Based on Cohen’s Kappa score, the agreement level is ‘substantial’ (Landis and Koch, 1977).

    • Mozilla: We contact six programmers and all of them responded. We observe a 94.0% agreement with a Cohen’s Kappa score of 0.9. Based on Cohen’s Kappa score, the agreement level is ‘almost perfect’ (Landis and Koch, 1977).

    • Openstack: We contact 10 programmers and all of them responded. We observe a 92.0% agreement with a Cohen’s Kappa score of 0.8. Based on Cohen’s Kappa score, the agreement level is ‘substantial’ (Landis and Koch, 1977).

    • Wikimedia: We contact seven programmers and all of them responded. We observe a 98.0% agreement with a Cohen’s Kappa score of 0.9. Based on Cohen’s Kappa score, the agreement level is ‘almost perfect’ (Landis and Koch, 1977).

We observe that the agreement between ours and the practitioners’ categorization varies from 0.8 to 0.9, which is higher than that of the agreement between the raters in the Categorization Phase. One possible explanation can be related to how the resolver resolved the disagreements. The first author of the paper has industry experience in writing IaC scripts, which may help to determine categorizations that are consistent with practitioners, potentially leading to higher agreement. Another possible explanation can be related to the sample provided to the practitioners: the provided sample, even though randomly selected, may include commit messages whose categorization are relatively easy to agree upon.

Finally, upon applying qualitative analysis we identify the category of each XCM. We mark XCMs that do not have the defect category ‘N’ as defect-related commits. The defect-related commits list the changed IaC scripts which we use to identify the defective IaC scripts. We present the count of defect-related commits, and defective IaC scripts in Table 3. We observe for Mozilla, 18.1% of the commits are defect-related, even though 89.9% of the commits included identifiers to issues. According to our qualitative analysis, for Mozilla, issue reports exist that are not related to defects such as, support of new features 161616https://bugzilla.mozilla.org/show_bug.cgi?id=868974, and installation issues 171717https://bugzilla.mozilla.org/show_bug.cgi?id=773931. In case of Openstack and Wikimedia, respectively, 28.8% and 16.4% of the Puppet-related commits include identifiers that map to issues.

Properties Dataset
Mirantis (MIR) Mozilla (MOZ) Openstack (OST) Wikimedia (WIK)
Time Period May 2010-Jul 2017 Aug 2011-Sep 2016 Mar 2011-Sep 2016 Apr 2005-Sep 2016
Puppet Commits 1021 of 2749, 37.1% 3074 of 60992, 5.0% 7808 of 31460, 24.8% 972 of 14717, 6.6%
Puppet Code Size (LOC) 17,564 30,272 122,083 17,439
Defect-related Commits 344 of 1021, 33.7% 558 of 3074, 18.1% 1987 of 7808, 25.4% 298 of 972, 30.6%
Defective Puppet Scripts 91 of 165, 55.1% 259 of 580, 44.6% 810 of 1383, 58.5% 161 of 296, 54.4%
Table 3: Defect Datasets Constructed for the Paper

4.4 Dataset Availability

The constructed datasets used for empirical analysis are available as online 181818https://doi.org/10.6084/m9.figshare.6465215.

5 Results

In this section, we provide empirical findings by answering the four research questions:

5.1 RQ1: What process improvement recommendations can we make for infrastructure as code development, using the defect type attribute of orthogonal defect classification?

We answer RQ1 by first presenting the values for defect count per category (DCC) that belong to each defect category mentioned in Table 1. In Figure 2, we report the DCC values for the four datasets.

Figure 2: Defect count per category (DCC) that belong to each defect category Algorithm (AL), Assignment (AS), Build/Package/Merge (B), Checking (C), Documentation (D), Function (F), Interface (I), Other (O), and Timing/Serialization (T). Defects categorized as ‘Assignment’ is the dominant category.

In Figure 2, the x-axis presents the nine defect categories, whereas, the y-axis presents DCC values for each category. As shown in Figure 2, Assignment-related defects account for 49.3%, 36.5%, 57.6%, and 62.7% of the defects, for Mirantis, Mozilla, Openstack, and Wikimedia, respectively. Together, the two categories, assignment and checking, accounts for 55.9%, 53.5%, 64.2%, and 74.8% of the defects, for Mirantis, Mozilla, Openstack, and Wikimedia, respectively.

In short, we observe assignment-related defects (AS) to be the dominant defect category for all four datasets, followed by checking-related defects (C). One possible explanation can be related to how practitioners utilize IaC scripts. IaC scripts are used to manage configurations and deployment infrastructure automatically (Humble and Farley, 2010). For example, practitioners use IaC to provision cloud instances such as Amazon Web Services (AWS) (Cito et al, 2015), or manage dependencies of software (Humble and Farley, 2010). When assigning configurations of library dependencies or provisioning cloud instances programmers might be inadvertently introducing defects, that needs fixing. Fixing these defects involve few lines of code in IaC scripts, and these defects fall in the assignment category. Correcting syntax issues also involve a few lines of code, and also belongs to category assignment. Examples of defect-related XCMs that belong to the assignment category are: ‘fix gdb version to be more precise’ and ‘bug 867593 use correct regex syntax’. Another possible explanation can be related to the declarative nature of Puppet scripts. Puppet provides syntax to declare and assign configurations for the system of interest. While assigning these configuration values programmers may be inadvertently introducing defects.

Our findings are consistent with prior research related to defect categorization. In case of machine learning software such as Apache Mahout 

191919http://mahout.apache.org/, Apache Lucene 202020https://lucene.apache.org/core/, and OpenNLP 212121https://opennlp.apache.org/, that are dedicated for executing machine learning algorithms, researchers have observed the most frequently occurring defects are algorithm-related (Thung et al, 2012). For database software, Sullivan et al. (Sullivan and Chillarege, 1991) have observed that defect distribution is dominated by assignment and checking-related defects. Chillarege et al. (Chillarege et al, 1992) found Sullivan et al. (Sullivan and Chillarege, 1991)’s observations ‘reasonable’, as few lines of code for database software typically miss a condition, or assign a wrong value.

As shown in Figure 2, for category Other (O), defect count per category (DCC) is 12.5%, 9.4%, 6.7%, and 0.8%, for Mirantis, Mozilla, Openstack, and Wikimedia, respectively. This category includes defects that correspond to a defect but the rater was not able to identify the category of the defect. Examples of such defect-related commit messages include ‘minor puppetagain fixes’, and ‘summary fix hg on osx; a=bustage’. One possible explanation can be attributed to the lack of information content provided in the messages. Programmers may not strictly adhere to the practice of writing good commit messages, which eventually leads to commit messages that do not provide enough information for defect categorization. For example, the commit message ‘minor puppetagain fixes’, implies that a programmer performed a fix-related action on an IaC script, but what category of defect was fixed remains unknown. We observe organization-based guidelines to exist on how to write better commit messages for Mozilla 222222http://tiny.cc/moz-commit, Openstack 232323https://wiki.openstack.org/wiki/GitCommitMessages, and Wikimedia 242424https://www.mediawiki.org/wiki/Gerrit/Commit_message_guidelines. Our findings suggest despite the availability of commit message guidelines, programmers do not adhere to these guidelines.

Another possible explanation can be attributed to the lack of context inherent in commit messages. The commit messages provide a summary of the changes being made, but that might not be enough to determine the defect category. Let us consider two examples in this regard, provided in Figures (a)a and (b)b. Figures (a)a and (b)b respectively presents two XCMs categorized as ‘Other’, and obtained respectively from Mozilla and Openstack. In Figure (b)b, we observe that in the commit, a newline is being added for printing purposes, which is not captured in the commit message ‘summary: bug 746824: minor fixes ’. From Figure (b)b, we observe that in the commit, the ‘stdlib::safe_package’ is being replaced by the ‘package’ syntax, which is not captured by the corresponding commit message ‘minor fixes’.

(a)
(b)
Figure 3: Code changes in commits categorized as the ‘Other’ category. Figures (a)a and (b)b respectively presents the code changes for two commits marked as ‘Other’ obtained from Mozilla and Openstack.

Recommendations for IaC Development: Chillarege et al. (Chillarege et al, 1992) have reported that in ODC, each of the defect categories as determined by the defect type attribute of ODC, provide practitioners the opportunity to improve their software development process. According to Chillarege et al. (Chillarege et al, 1992), each defect category is related to one or multiple activities of software development. For example, function-related defects are related with the design and functional testing of software development. Prevalence of defects in a certain category indicates in which software development activities IT organizations can invest more effort. Our findings indicate that assignment-related defects are the most dominant category of defects. According to Chillarege et al. (Chillarege et al, 1992), if assignment-related defects are not discovered with code inspection and unit tests earlier, these defects can continue to grow at latter stages of development. Based on our findings, IT organizations can reap process improvement opportunities, if they allocate more code inspection and unit testing efforts.

Our findings may have implications on prioritizing verification and validation (V&V) efforts, as well. Ideally, we would expect IT organizations to allocate sufficient V&V efforts to detect all categories of defects. But in the real world, V&V efforts are limited. If practitioners want to prioritize their V&V efforts, then they can first focus on assignment-related defects, as these category of defects are dominant for all four organizations.

Assignment is the most frequently occurring defect category for all four datasets: Mirantis, Mozilla, Openstack, and Wikimedia. For Mirantis, Mozilla, Openstack, and Wikimedia, respectively 49.3%, 36.5%, 57.6%, and 62.7%, of the defects belong to the category, assignment. Based on our findings, we recommend practitioners to allocate more efforts on code inspection and unit testing.

5.2 RQ2: What are the differences between infrastructure as code (IaC) and non-IaC software process improvement activities, as determined by their defect category distribution reported in prior literature?

We identify 26 software systems using the three steps outlined in Section 3.4:

  • Step-1: As of August 11, 2018, 818 publications indexed by ACM Digital Library or IEEE Xplore or SpringerLink or ScienceDirect, cited the original ODC publication (Chillarege et al, 1992). Of the 818 publications, 674 publications were published on or after 2000.

  • Step-2: Of these 674 publications, 16 applied the defect type attribute of ODC in its original form to categorize defects for software systems.

  • Step-3: Of these 16 publications, seven publications explicitly mentioned the total count of defects and provided a distribution of defect categories.

In Table 4, we present the categorization of defects for these 26 software systems. The ‘System’ column reports the studied software system followed by the publication reference in which the findings were reported. The ‘Count’ column reports the total count of defects that were studied for the software system. The next eight consecutive columns respectively present the eight defect categories used in the defect type attribute of ODC: algorithm (AL), assignment (AS), block (B), checking (C), documentation (D), function (F), interface (I), and timing (T). We do not report the ‘Other’ category, as this category is not included as part of the ODC defect type attribute. The ‘Lang.’ column presents the programming language using which the system is developed. The dominant defect category is highlighted in bold.

From Table 4, we observe that for four of the 26 software systems, 40% or more of the defect categories belonged to category assignment. For 15 of the 26 software systems algorithm-related defects are dominant. We also observe documentation, and timing-related defects to rarely occur in previously studied software systems. In contrast to IaC scripts, assignment-related defects are not prevalent: assignment-related defects were the dominant category for 3 of the 26 previously studied non-IaC software systems. We observe IaC scripts to have a different defect category distribution than non-IaC software system written in GPLs; the dominant defect category for IaC scripts is assignment.

The differences in defect category distribution may yield different set of guidelines for software process improvement. For the 15 software systems where algorithm-related defects are dominant, based on ODC, software process improvement efforts can be focused on the following activities: coding, code inspection, unit testing, and functional testing. In case of IaC scripts, as previously discussed, software process improvement efforts can be focused on code inspection and unit testing, which is different to that of non-IaC systems.

Defects categorized as assignment are more prevalent amongst IaC scripts compared to that of previously studied non-IaC systems. Our findings suggest that process improvement activities will be different for IaC development compared to that of non-IaC software development.

System Lang. Count AL(%) AS(%) B(%) C(%) D(%) F(%) I(%) T(%)
Bourne Again Shell (BASH) (Cotroneo et al, 2013) C 2 0.0 100.0 0.0 0.0 0.0 0.0 0.0 0.0
ZSNES-Emulator for x86 (Cotroneo et al, 2013) C, C++ 3 33.3 66.7 0.0 0.0 0.0 0.0 0.0 0.0
Pdftohtml-Pdf to html converter (Cotroneo et al, 2013) Java 20 40.0 55.0 0.0 5.0 0.0 0.0 0.0 0.0
Firebird-Relational DBMS (Cotroneo et al, 2013) C++ 2 0.0 50.0 0.0 50.0 0.0 0.0 0.0 0.0
Air flight application (Lyu et al, 2003) C 426 19.0 31.9 0.0 14.0 0.0 33.8 1.1 0.0
Apache web server (Pecchia and Russo, 2012) C 1,101 47.6 26.4 0.0 12.8 0.0 0.0 12.9 0.0
Joe-Tex editor (Cotroneo et al, 2013) C 78 15.3 25.6 0.0 44.8 0.0 0.0 14.1 0.0
Middleware system for air traffic control (Cinque et al, 2014) C 3,159 58.9 24.5 0.0 1.7 0.0 0.0 14.8 0.0
ScummVM-Interpreter for adventure engines (Cotroneo et al, 2013) C++ 74 56.7 24.3 0.0 8.1 0.0 6.7 4.0 0.0
Linux kernel (Cotroneo et al, 2013) C 93 33.3 22.5 0.0 25.8 0.0 12.9 5.3 0.0
Vim-Linux editor (Cotroneo et al, 2013) C 249 44.5 21.2 0.0 22.4 0.0 5.2 6.4 0.0
MySQL DBMS (Pecchia and Russo, 2012) C, C++ 15,102 52.9 20.5 0.0 15.3 0.0 0.0 11.2 0.0
CDEX-CD digital audio data extractor (Cotroneo et al, 2013) C, C++, Python 11 9.0 18.1 0.0 18.1 0.0 0.0 54.5 0.0
Struts (Basso et al, 2009) Java 99 48.4 18.1 0.0 9.0 0.0 4.0 20.2 0.0
Safety critical system for NASA spacecraft (Lutz and Mikulski, 2004) Java 199 17.0 15.5 2.0 0.0 2.5 29.1 5.0 10.5
Azureus (Basso et al, 2009) Java 125 36.8 15.2 0.0 11.2 0.0 28.8 8.0 0.0
Phex (Basso et al, 2009) Java 20 60.0 15.0 0.0 5.0 0.0 10.0 10.0 0.0
TAO Open DDS (Pecchia and Russo, 2012) Java 1,184 61.4 14.4 0.0 11.7 0.0 0.0 12.4 0.0
JEdit (Basso et al, 2009) Java 71 36.6 14.0 0.0 11.2 0.0 25.3 12.6 0.0
Tomcat (Basso et al, 2009) Java 169 57.9 12.4 0.0 13.6 0.0 2.3 13.6 0.0
FreeCvi-Strategy game (Cotroneo et al, 2013) Java 53 52.8 11.3 0.0 13.2 0.0 15.0 7.5 0.0
FreeMind (Basso et al, 2009) Java 90 46.6 11.1 0.0 2.2 0.0 28.8 11.1 0.0
MinGW-Minimalist GNU for Windows (Cotroneo et al, 2013) C 60 46.6 10.0 0.0 38.3 0.0 0.0 5.0 0.0
Java Enterprise Framework (Gupta et al, 2009) Java 223 3.1 16.7 36.0 3.1 0.0 21.2 19.9 0.0
Digital Cargo Files (Gupta et al, 2009) Java 438 3.3 12.4 31.0 10.4 0.5 31.3 9.9 1.2
Shipment and Allocation (Gupta et al, 2009) Java 649 22.8 6.7 18.7 10.9 0.5 29.8 9.3 1.6
Mirantis [This paper] Puppet 344 6.5 49.3 6.7 1.9 7.5 6.4 12.5 2.6
Mozilla [This paper] Puppet 558 7.7 36.5 6.4 17.0 2.3 10.0 1.9 8.4
Openstack [This paper] Puppet 1987 5.9 57.5 8.6 6.7 2.6 2.4 2.9 6.5
Wikimedia [This paper] Puppet 298 3.3 62.7 4.7 12.0 4.3 4.3 2.6 5.1
Table 4: Defect Categories of Previously Studied Software Systems

5.3 RQ3: Can the size of infrastructure as code (IaC) scripts provide a signal for improvement of the IaC development process?

(a)
(b)
(c)
(d)
Figure 4: Size of defective and neutral IaC scripts, measured in LOC. Figures (a)a,  (b)b,  (c)c,  (d)d respectively presents the size of defective and neutral scripts respectively, for Mirantis, Mozilla, Openstack, and Wikimedia.

As shown in Figure 4, we answer RQ3 by first reporting the distribution of size in LOC, for defective and neutral scripts. In Figure 4, the y-axis presents the size of scripts in LOC, whereas the x-axis presents the two groups: ‘Defective’ and ‘Neutral’. The median size of defective script is 90.0, 53.0, 77.0, and 57.0 LOC, respectively for Mirantis, Mozilla, Openstack, and Wikimedia. The median size of a defective script is respectively, 2.3, 2.1, 1.6, and 2.8 times larger for Mirantis, Mozilla, Openstack, and Wikimedia than a neutral script. The size difference is significant for all four datasets (), with an effect size of 0.5, 0.5, 0.3, and 0.5, respectively for Mirantis, Mozilla, Openstack, and Wikimedia. Based on Cliffs Delta values, the size difference between defective and neutral scripts are respectively ‘large’, ‘large’, ‘small’, and ‘large’, for Mirantis, Mozilla, Openstack, and Wikimedia.

Figure 5: The Scott-Knott (SK) ranks for each defect category and each dataset. For four datasets, the same rank across all defect categories states that with respect to size, there is no significant difference between the defect categories. For Mirantis, Assignment and Build/Package-related defects are smaller in size than other defect categories.

In Figure 5, we report the results of our SK test. As shown in Figure 5, we do not observe any significant difference for script size between the defect categories, as for the three datasets, Mozilla, Openstack, and Wikimedia the rank is same for all nine defect categories. For Mirantis, the rank is highest (1) for categories ‘Algorithm’, ‘Checking’, ‘Function’, ‘Interface’, ‘Timing’, and ‘Other’. For the categories ‘Assignment’, ‘Build/Package/Merge’ and ‘Documentation’, the rank is lowest (2). Our findings indicate of no trends consistent across the four datasets with respect to the relation between size and defect categories. The observation that for Mirantis we observe script size differences across defect categories indicates IaC development process differences from one organization to another.

(a)
(b)
(c)
(d)
Figure 6: Distribution of script size for each defect category. Figures (a)a,  (b)b,  (c)c,  (d)d respectively presents the distribution for Mirantis, Mozilla, Openstack, and Wikimedia.

As part of our methodology, we also present the distribution of size for each defect category in Figure 6. The y-axis in each subplot presents the size in LOC for each defect category. For Mozilla, considering median script size, category ‘Interface (I)’ is the largest (87.0 LOC), and category ‘Checking (C)’ is the smallest (53.5 LOC). For Openstack and Wikimedia, categories ‘Documentation (D)’ and ‘Build/Package/Merge (B)’ are the largest with a median size of of 160.0 LOC, and 118.0 LOC, respectively. Categories ‘Function (F)’ and ‘Other (O)’ are the smallest, with a median size of 86.0 LOC, and 46.5 LOC, respectively. However, the differences in script size across defect categories is not significant, as suggested by the Scott-Knott test for Mozilla, Openstack, and Wikimedia (Figure 5). In the case of Mirantis, category ‘Build/Package/Merge’ is the smallest with a median size of 91 LOC. Category ‘Algorithm’ is the largest category with a median size of 221 LOC.

Our findings of the relationship between size and defect categories and defect distribution highlight organizational process differences. Considering script size, IaC scripts are not all developed in the same manner, nor do they contain the same category of defects. That said, regardless of defect categories, defective scripts are significantly larger than neutral scripts. This finding can be useful for (i) practitioners: if their IaC codebase includes large scripts, then those scripts can have additional V&V resource; and (ii) researchers: future research studies can identify defective IaC scripts through machine learning techniques using size of IaC scripts.

Defective IaC scripts contain significantly more lines of code than neutral scripts. We recommend practitioners to invest additional verification and validation resources for IaC scripts that are relatively larger in size.

5.4 RQ4: How frequently do defects occur in infrastructure as code scripts?

The defect density, measured in is respectively, 27.6, 18.4, 16.2, and 17.1 per 1000 LOC, respectively, for Mirantis, Mozilla, Openstack, and Wikimedia. Prior research shows that defect densities can vary from one software system to another, and one organization to another. For a Fortran-based satellite planning software system, Basili and Perricone (Basili and Perricone, 1984) have reported defect density to vary from 6.4 to 16.0, for every 1000 LOC. Mohagheghi et al. (Mohagheghi et al, 2004) have reported defect density to vary between 0.7 and 3.7 per KLOC for a telecom software system written in C, Erlang, and Java. For a Java-based system, researchers (Maximilien and Williams, 2003) have reported a defect density of 4.0 defects/KLOC.

We notice the magnitude of defect density of IaC scripts to be higher than that of previously studied software systems. We use Basili and Perricone’s observations (Basili and Perricone, 1984) as a possible explanation for our findings. Basili and Perricone (Basili and Perricone, 1984) suggested that if defects are spread out equally across artifacts then the overall defect density of the system can be high. Our possible explanation in this regard is that for IaC defects are equally distributed as suggested by Basili and Perricone (Basili and Perricone, 1984).

We report the overall temporal frequency values for the four datasets in Figure 7. In Figure 7 we apply smoothing to obtain visual trends. The x-axis and y-axis respectively presents the months, the overall temporal frequency values, for each month. For Mozilla and Openstack we observe constant trends. For Wikimedia, initially defects appear more frequently, and decrease over time. According to our Cox-Stuart test results, as shown in Table 5, all trends are consistent. The p-values obtained from the Cox-Stuart test output for Mirantis, Mozilla, Openstack, and Wikimedia are respectively, 0.22, 0.23, 0.42, and 0.13. Our findings indicate that overall, frequency of defects do not significantly change across time for IaC scripts. Findings from Figure 7 also highlight adoption time differences for IaC. For Wikimedia, IaC was adopted in 11 years ago, but for Mirantis IaC was adopted in 2010, and for Mozilla and Openstack IaC was adopted in 2011.

(a)
(b)
(c)
(d)
Figure 7: Temporal frequency values for all four datasets. Figures (a)a,  (b)b,  (c)c, and (d)d respectively presents the temporal frequency values for Mirantis, Mozilla, Openstack, and Wikimedia with smoothing. Overall, defect-relate commits exhibit consistent trends over time.
Output Mirantis Mozilla Openstack Wikimedia
Trend Increasing Decreasing Decreasing Decreasing
0.22 0.23 0.42 0.13
Table 5: Cox-Stuart Test Results for Overall Temporal Frequency of Defect Count

We use our findings related to temporal frequency to draw parallels with hardware and non-IaC software systems. For hardware systems, researchers (Smith, 1993) have reported the ‘bathtub curve’ trend, which states when initially hardware systems are put into service, the frequency of defects is high. As time progresses, defect frequency decreases, and remains constant during the ‘adult period’. However, after the adult period, defect frequency becomes high again, as hardware systems enter the ‘wear out period’ (Smith, 1993). For IaC defects we do not observe such temporal trend.

In traditional software systems such as non-IaC systems, initially defect frequency is high, which gradually decreases, and eventually becomes low as time progresses (Hartz et al, 1996). Eventually, all software systems enter the ‘obsolescence period’ where defect frequency remains consistent, as no significant upgrades or changes to the software is made (Hartz et al, 1996). For Wikimedia we observe a similar trend to that of traditional software systems: initially defect frequency remains high, but decreases as time progresses. However this observation does not generalize for the other three datasets. Also, according to the Cox-Stuart test, the visible trends are not statistically validated. One possible explanation can be attributed to how IT organizations resolve existing defects, for example, while fixing a certain set of defects, programmers may be inadvertently introducing a new set of defects, which results in an overall constant trend.

The defect density is 27.6, 18.4, 16.2, and 17.1 defects per KLOC respectively for Mirantis, Mozilla, Openstack, and Wikimedia. For all four datasets we observe IaC defects to follow a consistent temporal trend.

6 Discussion

In this section, we discuss our findings with possible implications:

6.1 Implications for Process Improvement

One finding our paper is the prevalence of assignment-related defects in IaC scripts. Software teams can use this finding to improve their process in two possible ways: first, they can use the practice of code review for developing IaC scripts. Code reviews can be conducted using automated tools and/or team members’ manual reviews. For example, through code reviews software teams can pinpoint the correct value of configurations at development stage. Automated code review tools such as linters can also help in detecting and fixing syntax issues of IaC scripts at the development stage. Typically, IaC scripts are used by IT organizations that have implemented CD, and for these organizations, Kim et al. (Kim et al, 2016) recommends manual peer review methods such as pair programming to improve code quality.

Second, software teams might benefit from unit testing of IaC scripts to reduce defects related to configuration assignment. We have observed from Section 5.1 that IaC-related defects are mostly related to the assignment category which includes improper assignment of configuration values and syntax errors. Programmers can test if correct configuration values are assigned by writing unit tests for components of IaC scripts. In this manner, instead of catching defects at run-time that might lead to real-world consequences e.g. the problem reported by Wikimedia Commons 252525http://tiny.cc/wik-outage-jan17, software teams might be able to catch defects in IaC scripts at the development stages with the help of testing.

6.2 Future Research

Our findings have the potential to facilitate further research in the area of IaC defects. We have observed size of IaC scripts to be correlated with defects in Section 5.3. Our findings can be helpful in building size-based prediction models to identify defective IaC scripts. In Sections 5.3 and 5.4 we have observed the process differences that occur between IT organizations, and future research can systematically investigate if there are process differences in IaC development, and why they exist.

We have applied a qualitative process to categorize defects using the defect type attribute of ODC. We acknowledge that our process is manual and labor-intensive. We advocate for future research that can look into how the process of ODC can be automated.

7 Threats To Validity

We describe the threats to validity of our paper as following:

  • Conclusion Validity: Our findings are subject to conclusion validity as the defect categorization process involves human judgment. Our approach is based on qualitative analysis, where raters categorized XCMs, and assigned defect categories. We acknowledge that the process is susceptible human judgment, and as the raters’ experience can bias the categories assigned. The accompanying human subjectivity can influence the distribution of the defect category for IaC scripts of interest.

    We mitigated this threat by assigning multiple raters for the same set of XCMs. Next, we used a resolver, who resolved the disagreements. Further, we cross-checked our categorization with practitioners who authored the XCMs, and observed ‘substantial’ to ‘almost perfect’ agreement.

  • Internal Validity: Our empirical study is based on OSS repositories maintained by four organizations: Mirantis, Mozilla, Openstack, and Wikimedia Commons. The selected repositories are subject to selection bias, as we extract the required IaC scripts only from the OSS domain. We acknowledge that our sample of IaC scripts can be limiting, and we hope to mitigate this limitation by collecting IaC scripts from proprietary sources.

    We have used a combination of commit messages and issue report descriptions to determine if an IaC script is associated with a defect. We acknowledge that these messages might not have given the full context for the raters. Other sources of information such as practitioner input, and code changes that take place in each commit could have provided the raters better context to categorize the XCMs.

    We acknowledge that we have not used the trigger attribute of ODC, as our data sources do not include any information from which we can infer trigger attributes of ODC such as, ‘design conformance’, ‘side effects’, and ‘concurrency’.

  • Construct validity: Our process of using human raters to determine defect categories can be limiting, as the process is susceptible to mono-method bias, where subjective judgment of raters can influence the findings. We mitigated this threat by using multiple raters.

    Also, for Mirantis and Wikimedia, we used graduate students who performed the categorization as part of their class work. Students who participated in the categorization process can be subject to evaluation apprehension, i.e. consciously or sub-consciously relating their performance with the grades they would achieve for the course. We mitigated this threat by clearly explaining to the students that their performance in the categorization process would not affect their grades.

    The raters involved in the categorization process had professional experience in software engineering for at two years on average. Their experience in software engineering may make the raters curious about the expected outcomes of the categorization process, which may effect the distribution of the categorization process. Furthermore, the resolver also has professional experience in software engineering and IaC script development, which could influence the outcome of the defect category distribution.

  • External Validity: Our scripts are collected from the OSS domain, and not from proprietary sources. Our findings are subject to external validity, as our findings may not be generalizable.

    We construct our datasets using Puppet, which is a declarative language. Our findings may not generalize for IaC scripts that use an imperative form of language. We hope to mitigate this limitation by analyzing IaC scripts written using imperative form of languages.

8 Conclusion

Use of IaC scripts is crucial for the automated maintenance of software delivery and deployment infrastructure. Similar to software source code, IaC scripts churn frequently, and can include defects which has serious real-world consequences. IaC defect categorization can help IT organizations to identify opportunities to improve IaC development. We have conducted an empirical analysis using four datasets from four organizations namely, Mirantis, Mozilla, Openstack, and Wikimedia Commons. With 89 raters we apply the defect type attribute of the orthogonal defect classification (ODC) methodology to categorize the defects. For all four datasets, we have observed that majority of the defects are related to assignment i.e. defects related to syntax errors and erroneous configuration assignments. We have observed compared to IaC scripts, assignment-related defects occur less frequently in non-IaC software. We also have observed the defect density is 27.6, 18.4, 16.2, and 17.1 defects per KLOC respectively for Mirantis, Mozilla, Openstack, and Wikimedia. For all four datasets we observe IaC defects to follow a consistent temporal trend. We hope our findings will facilitate further research in IaC defect analysis.

Acknowledgements.
The Science of Security Lablet at the North Carolina State University supported this research study. We thank the Realsearch research group members for their useful feedback. We also thank the practitioners who answered our questions.

References

  • Alali et al (2008) Alali A, Kagdi H, Maletic JI (2008) What’s a typical commit? a characterization of open source software repositories. In: 2008 16th IEEE International Conference on Program Comprehension, pp 182–191, DOI 10.1109/ICPC.2008.24
  • Basili and Perricone (1984) Basili VR, Perricone BT (1984) Software errors and complexity: An empirical investigation0. Commun ACM 27(1):42–52, DOI 10.1145/69605.2085, URL http://doi.acm.org/10.1145/69605.2085
  • Basso et al (2009) Basso T, Moraes R, Sanches BP, Jino M (2009) An investigation of java faults operators derived from a field data study on java software faults. In: Workshop de Testes e Tolerância a Falhas, pp 1–13
  • Battin et al (2001) Battin RD, Crocker R, Kreidler J, Subramanian K (2001) Leveraging resources in global software development. IEEE Software 18(2):70–77, DOI 10.1109/52.914750
  • van der Bent et al (2018) van der Bent E, Hage J, Visser J, Gousios G (2018) How good is your puppet? an empirically defined and validated quality model for puppet. In: 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER), pp 164–174, DOI 10.1109/SANER.2018.8330206
  • Chillarege et al (1992) Chillarege R, Bhandari IS, Chaar JK, Halliday MJ, Moebus DS, Ray BK, Wong MY (1992) Orthogonal defect classification-a concept for in-process measurements. IEEE Transactions on Software Engineering 18(11):943–956, DOI 10.1109/32.177364
  • Christmansson and Chillarege (1996) Christmansson J, Chillarege R (1996) Generation of an error set that emulates software faults based on field data. In: Proceedings of Annual Symposium on Fault Tolerant Computing, pp 304–313, DOI 10.1109/FTCS.1996.534615
  • Cinque et al (2014) Cinque M, Cotroneo D, Corte RD, Pecchia A (2014) Assessing direct monitoring techniques to analyze failures of critical industrial systems. In: 2014 IEEE 25th International Symposium on Software Reliability Engineering, pp 212–222, DOI 10.1109/ISSRE.2014.30
  • Cito et al (2015) Cito J, Leitner P, Fritz T, Gall HC (2015) The making of cloud applications: An empirical study on software development for the cloud. In: Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, ACM, New York, NY, USA, ESEC/FSE 2015, pp 393–403, DOI 10.1145/2786805.2786826, URL http://doi.acm.org/10.1145/2786805.2786826
  • Cliff (1993) Cliff N (1993) Dominance statistics: Ordinal analyses to answer ordinal questions. Psychological Bulletin 114(3):494–509
  • Cohen (1960) Cohen J (1960) A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20(1):37–46, DOI 10.1177/001316446002000104, URL http://dx.doi.org/10.1177/001316446002000104, http://dx.doi.org/10.1177/001316446002000104
  • Commons (2018) Commons W (2018) Projects: Gerrit Wikimedia. https://gerrit.wikimedia.org/r/#/admin/projects/, [Online; accessed 12-September-2018]
  • Cotroneo et al (2013) Cotroneo D, Pietrantuono R, Russo S (2013) Testing techniques selection based on odc fault types and software metrics. J Syst Softw 86(6):1613–1637, DOI 10.1016/j.jss.2013.02.020, URL http://dx.doi.org/10.1016/j.jss.2013.02.020
  • Cox and Stuart (1955) Cox DR, Stuart A (1955) Some quick sign tests for trend in location and dispersion. Biometrika 42(1/2):80–95, URL http://www.jstor.org/stable/2333424
  • Duraes and Madeira (2006) Duraes JA, Madeira HS (2006) Emulation of software faults: A field data study and a practical approach. IEEE Trans Softw Eng 32(11):849–867, DOI 10.1109/TSE.2006.113, URL http://dx.doi.org/10.1109/TSE.2006.113
  • Fenton and Ohlsson (2000) Fenton NE, Ohlsson N (2000) Quantitative analysis of faults and failures in a complex software system. IEEE Trans Softw Eng 26(8):797–814, DOI 10.1109/32.879815, URL http://dx.doi.org/10.1109/32.879815
  • Fonseca and Vieira (2008) Fonseca J, Vieira M (2008) Mapping software faults with web security vulnerabilities. In: 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN), pp 257–266, DOI 10.1109/DSN.2008.4630094
  • Fowler (2010) Fowler M (2010) Domain Specific Languages, 1st edn. Addison-Wesley Professional
  • Freimut et al (2005) Freimut B, Denger C, Ketterer M (2005) An industrial case study of implementing and validating defect classification for process improvement and quality management. In: 11th IEEE International Software Metrics Symposium (METRICS’05), pp 10 pp.–19, DOI 10.1109/METRICS.2005.10
  • Gupta et al (2009) Gupta A, Li J, Conradi R, Rønneberg H, Landre E (2009) A case study comparing defect profiles of a reused framework and of applications reusing it. Empirical Softw Engg 14(2):227–255, DOI 10.1007/s10664-008-9081-9, URL http://dx.doi.org/10.1007/s10664-008-9081-9
  • Hanappi et al (2016) Hanappi O, Hummer W, Dustdar S (2016) Asserting reliable convergence for configuration management scripts. SIGPLAN Not 51(10):328–343, DOI 10.1145/3022671.2984000, URL http://doi.acm.org/10.1145/3022671.2984000
  • Harlan (1987) Harlan D (1987) Cleanroom Software Engineering
  • Hartz et al (1996) Hartz MA, Walker EL, Mahar D (1996) Introduction to software reliability: A state of the art review. The Center
  • Hatton (1997) Hatton L (1997) Reexamining the fault density-component size connection. IEEE Softw 14(2):89–97, DOI 10.1109/52.582978, URL http://dx.doi.org/10.1109/52.582978
  • Henningsson and Wohlin (2004) Henningsson K, Wohlin C (2004) Assuring fault classification agreement ” an empirical evaluation. In: Proceedings of the 2004 International Symposium on Empirical Software Engineering, IEEE Computer Society, Washington, DC, USA, ISESE ’04, pp 95–104, DOI 10.1109/ISESE.2004.13, URL http://dx.doi.org/10.1109/ISESE.2004.13
  • Herzig et al (2013) Herzig K, Just S, Zeller A (2013) It’s not a bug, it’s a feature: How misclassification impacts bug prediction. In: 2013 35th International Conference on Software Engineering (ICSE), pp 392–401, DOI 10.1109/ICSE.2013.6606585
  • Humble and Farley (2010) Humble J, Farley D (2010) Continuous Delivery: Reliable Software Releases Through Build, Test, and Deployment Automation, 1st edn. Addison-Wesley Professional
  • IEEE (2010) IEEE (2010) Ieee standard classification for software anomalies. IEEE Std 1044-2009 (Revision of IEEE Std 1044-1993) pp 1–23, DOI 10.1109/IEEESTD.2010.5399061
  • Jiang and Adams (2015) Jiang Y, Adams B (2015) Co-evolution of infrastructure and source code: An empirical study. In: Proceedings of the 12th Working Conference on Mining Software Repositories, IEEE Press, Piscataway, NJ, USA, MSR ’15, pp 45–55, URL http://dl.acm.org/citation.cfm?id=2820518.2820527
  • Kim et al (2016) Kim G, Debois P, Willis J, Humble J (2016) The DevOps Handbook: How to Create World-Class Agility, Reliability, and Security in Technology Organizations. IT Revolution Press
  • Labs (2017) Labs P (2017) Puppet Documentation. https://docs.puppet.com/, [Online; accessed 19-July-2017]
  • Landis and Koch (1977) Landis JR, Koch GG (1977) The measurement of observer agreement for categorical data. Biometrics 33(1):159–174, URL http://www.jstor.org/stable/2529310
  • Lutz and Mikulski (2004) Lutz RR, Mikulski IC (2004) Empirical analysis of safety-critical anomalies during operations. IEEE Transactions on Software Engineering 30(3):172–180, DOI 10.1109/TSE.2004.1271171
  • Lyu et al (2003) Lyu MR, Huang Z, Sze SKS, Cai X (2003) An empirical study on testing and fault tolerance for software reliability engineering. In: 14th International Symposium on Software Reliability Engineering, 2003. ISSRE 2003., pp 119–130, DOI 10.1109/ISSRE.2003.1251036
  • Mann and Whitney (1947)

    Mann HB, Whitney DR (1947) On a test of whether one of two random variables is stochastically larger than the other. The Annals of Mathematical Statistics 18(1):50–60, URL

    http://www.jstor.org/stable/2236101
  • Maximilien and Williams (2003) Maximilien EM, Williams L (2003) Assessing test-driven development at ibm. In: Proceedings of the 25th International Conference on Software Engineering, IEEE Computer Society, Washington, DC, USA, ICSE ’03, pp 564–569, URL http://dl.acm.org/citation.cfm?id=776816.776892
  • McConnell (2004) McConnell S (2004) Code complete - a practical handbook of software construction, 2nd Edition
  • McCune and Jeffrey (2011) McCune JT, Jeffrey (2011) Pro Puppet, 1st edn. Apress, DOI 10.1007/978-1-4302-3058-8, URL https://www.springer.com/gp/book/9781430230571
  • Mirantis (2018) Mirantis (2018) Mirantis. https://github.com/Mirantis, [Online; accessed 12-September-2018]
  • Mohagheghi et al (2004) Mohagheghi P, Conradi R, Killi OM, Schwarz H (2004) An empirical study of software reuse vs. defect-density and stability. In: Proceedings of the 26th International Conference on Software Engineering, IEEE Computer Society, Washington, DC, USA, ICSE ’04, pp 282–292, URL http://dl.acm.org/citation.cfm?id=998675.999433
  • Moller and Paulish (1993) Moller KH, Paulish DJ (1993) An empirical investigation of software fault distribution. In: [1993] Proceedings First International Software Metrics Symposium, pp 82–90, DOI 10.1109/METRIC.1993.263798
  • Mozilla (2018) Mozilla (2018) Mercurial repositories index. hg.mozilla.org, [Online; accessed 12-September-2018]
  • Munaiah et al (2017) Munaiah N, Kroh S, Cabrey C, Nagappan M (2017) Curating github for engineered software projects. Empirical Software Engineering pp 1–35, DOI 10.1007/s10664-017-9512-6, URL http://dx.doi.org/10.1007/s10664-017-9512-6
  • Openstack (2018) Openstack (2018) OpenStack git repository browser. http://git.openstack.org/cgit/, [Online; accessed 12-September-2018]
  • Parnin et al (2017) Parnin C, Helms E, Atlee C, Boughton H, Ghattas M, Glover A, Holman J, Micco J, Murphy B, Savor T, Stumm M, Whitaker S, Williams L (2017) The top 10 adages in continuous deployment. IEEE Software 34(3):86–95, DOI 10.1109/MS.2017.86
  • Pecchia and Russo (2012) Pecchia A, Russo S (2012) Detection of software failures through event logs: An experimental study. In: 2012 IEEE 23rd International Symposium on Software Reliability Engineering, pp 31–40, DOI 10.1109/ISSRE.2012.24
  • Puppet (2018) Puppet (2018) Ambit energy’s competitive advantage? it’s really a devops software company. Tech. rep., Puppet, URL https://puppet.com/resources/case-study/ambit-energy
  • Rahman and Williams (2018) Rahman A, Williams L (2018) Characterizing defective configuration scripts used for continuous deployment. In: 2018 IEEE 11th International Conference on Software Testing, Verification and Validation (ICST), pp 34–45, DOI 10.1109/ICST.2018.00014
  • Rahman et al (2017) Rahman A, Partho A, Meder D, Williams L (2017) Which factors influence practitioners’ usage of build automation tools? In: Proceedings of the 3rd International Workshop on Rapid Continuous Software Engineering, IEEE Press, Piscataway, NJ, USA, RCoSE ’17, pp 20–26, DOI 10.1109/RCoSE.2017..8, URL https://doi.org/10.1109/RCoSE.2017..8
  • Rahman et al (2018) Rahman A, Partho A, Morrison P, Williams L (2018) What questions do programmers ask about configuration as code? In: Proceedings of the 4th International Workshop on Rapid Continuous Software Engineering, ACM, New York, NY, USA, RCoSE ’18, pp 16–22, DOI 10.1145/3194760.3194769, URL http://doi.acm.org/10.1145/3194760.3194769
  • Rahman et al (2015) Rahman AAU, Helms E, Williams L, Parnin C (2015) Synthesizing continuous deployment practices used in software development. In: Proceedings of the 2015 Agile Conference, IEEE Computer Society, Washington, DC, USA, AGILE ’15, pp 1–10, DOI 10.1109/Agile.2015.12, URL http://dx.doi.org/10.1109/Agile.2015.12
  • Ray et al (2016) Ray B, Hellendoorn V, Godhane S, Tu Z, Bacchelli A, Devanbu P (2016) On the ”naturalness” of buggy code. In: Proceedings of the 38th International Conference on Software Engineering, ACM, New York, NY, USA, ICSE ’16, pp 428–439, DOI 10.1145/2884781.2884848, URL http://doi.acm.org/10.1145/2884781.2884848
  • Romano et al (2006)

    Romano J, Kromrey J, Coraggio J, Skowronek J (2006) Appropriate statistics for ordinal level data: Should we really be using t-test and Cohen’sd for evaluating group differences on the NSSE and other surveys? In: annual meeting of the Florida Association of Institutional Research, pp 1–3

  • Shambaugh et al (2016) Shambaugh R, Weiss A, Guha A (2016) Rehearsal: A configuration verification tool for puppet. SIGPLAN Not 51(6):416–430, DOI 10.1145/2980983.2908083, URL http://doi.acm.org/10.1145/2980983.2908083
  • Sharma et al (2016) Sharma T, Fragkoulis M, Spinellis D (2016) Does your configuration code smell? In: Proceedings of the 13th International Conference on Mining Software Repositories, ACM, New York, NY, USA, MSR ’16, pp 189–200, DOI 10.1145/2901739.2901761, URL http://doi.acm.org/10.1145/2901739.2901761
  • Smith (1993) Smith AM (1993) Reliability-centered maintenance, vol 83. McGraw-Hill New York
  • Sullivan and Chillarege (1991) Sullivan M, Chillarege R (1991) Software defects and their impact on system availability-a study of field failures in operating systems. In: [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium, pp 2–9, DOI 10.1109/FTCS.1991.146625
  • Tantithamthavorn et al (2017) Tantithamthavorn C, McIntosh S, Hassan AE, Matsumoto K (2017) An empirical comparison of model validation techniques for defect prediction models. IEEE Trans Softw Eng 43(1):1–18, DOI 10.1109/TSE.2016.2584050, URL https://doi.org/10.1109/TSE.2016.2584050
  • Thung et al (2012) Thung F, Wang S, Lo D, Jiang L (2012) An empirical study of bugs in machine learning systems. In: 2012 IEEE 23rd International Symposium on Software Reliability Engineering, pp 271–280, DOI 10.1109/ISSRE.2012.22
  • Voelter (2013) Voelter M (2013) DSL Engineering: Designing, Implementing and Using Domain-Specific Languages. CreateSpace Independent Publishing Platform, USA
  • Zhang et al (2016) Zhang F, Mockus A, Keivanloo I, Zou Y (2016) Towards building a universal defect prediction model with rank transformed predictors. Empirical Softw Engg 21(5):2107–2145, DOI 10.1007/s10664-015-9396-2, URL http://dx.doi.org/10.1007/s10664-015-9396-2
  • Zhang et al (2017) Zhang F, Hassan AE, McIntosh S, Zou Y (2017) The use of summation to aggregate software metrics hinders the performance of defect prediction models. IEEE Transactions on Software Engineering 43(5):476–491, DOI 10.1109/TSE.2016.2599161
  • Zheng et al (2006) Zheng J, Williams L, Nagappan N, Snipes W, Hudepohl JP, Vouk MA (2006) On the value of static analysis for fault detection in software. IEEE Transactions on Software Engineering 32(4):240–253, DOI 10.1109/TSE.2006.38