DeepAI
Log In Sign Up

Toward Understanding Deep Learning Framework Bugs

DL frameworks are the basis of constructing all DL programs and models, and thus their bugs could lead to the unexpected behaviors of any DL program or model relying on them. Such wide effect demonstrates the necessity and importance of guaranteeing DL frameworks' quality. Understanding the characteristics of DL framework bugs is a fundamental step for this quality assurance task, facilitating to design effective bug detection and debugging approaches. Hence, in this work we conduct the most large-scale study on 800 bugs from four popular and diverse DL frameworks (i.e., TensorFlow, PyTorch, MXNet, and DL4J). By analyzing the root causes and symptoms of DL framework bugs associated with 5 components decomposed from DL frameworks, as well as measuring test coverage achieved by three state-of-the-art testing techniques and developers' efforts on fixing those bugs, we obtain 14 major findings for the comprehensive understanding of DL framework bugs and the current status of existing DL framework testing and debugging practice, and then provide a series of actionable guidelines for better DL framework bug detection and debugging.

READ FULL TEXT VIEW PDF
12/26/2021

Silent Bugs in Deep Learning Frameworks: An Empirical Study of Keras and TensorFlow

Deep Learning (DL) frameworks are now widely used, simplifying the creat...
06/03/2019

A Comprehensive Study on Deep Learning Bug Characteristics

Deep learning has gained substantial popularity in recent years. Develop...
09/29/2022

IvySyn: Automated Vulnerability Discovery for Deep Learning Frameworks

We present IvySyn: the first fully-automated framework for vulnerability...
05/21/2021

TestRank: Bringing Order into Unlabeled Test Instances for Deep Learning Tasks

Deep learning (DL) has achieved unprecedented success in a variety of ta...
04/19/2022

Muffin: Testing Deep Learning Libraries via Neural Architecture Fuzzing

Deep learning (DL) techniques are proven effective in many challenging t...
07/21/2022

Demystifying Dependency Bugs in Deep Learning Stack

Recent breakthroughs in deep learning (DL) techniques have stimulated si...
04/29/2021

Self-Claimed Assumptions in Deep Learning Frameworks: An Exploratory Study

Deep learning (DL) frameworks have been extensively designed, implemente...

1. Introduction

In recent years, Deep Learning (DL) systems have become one of the most popular types of software systems and have been widely used in many domains, such as autonomous driving (Chen et al., 2015), aircraft collision avoidance (Julian et al., 2016), and software engineering (Ferreira et al., 2019). However, like traditional software, DL systems also contain bugs, which could lead to huge economic losses even threaten human lives. For example, in 2018, an Uber autonomous driving car killed a pedestrian in Arizona (32) and a Tesla Model S in autopilot mode crashed into a fire truck parked with light flashing on a California freeway (33). Therefore, guaranteeing the quality of DL systems is critical.

A DL system typically involves three levels (Yan et al., 2021b): the production level (i.e., DL models), program level (i.e., DL programs used for training DL models), and framework level (i.e., DL frameworks used by developers for implementing DL programs). Bugs in any level could affect the overall quality of the DL system. Thus, it is necessary to ensure DL systems’ quality at all these levels. Over the years, a lot of researches focus on the production level by designing various DL model testing metrics (Kim et al., 2019; Ma et al., 2018a, c) or proposing various adversarial input generation methods (Goodfellow et al., 2015; Kurakin et al., 2017), as well as the program level by studying the characteristics of DL program bugs (Zhang et al., 2018b; Islam et al., 2019; Humbatova et al., 2020) or designing bug detection and diagnosis methods (Yan et al., 2021a; Zhang et al., 2021a; Wardat et al., 2021). However, there is little attention on the framework level. Actually, DL frameworks are the basis of constructing all DL programs and models, and thus their bugs could produce much wider effect than the bugs in a specific DL program or model. Therefore, it is very essential to put more effort in ensuring the quality of DL frameworks, and this work does focus on the framework level.

Indeed, DL frameworks’ quality has begun to receive attention recently, and some DL framework testing techniques have been proposed (Pham et al., 2019; Wang et al., 2020b; Guo et al., 2020; Zhang et al., 2021b). Although they have been demonstrated to be effective to detect some new bugs in their experiments, they tend to treat the DL framework under test as a black box and lack the comprehensive understanding of the DL framework bug characteristics (such as root causes and bug distribution). Such a lack could cause that it is still unknown whether these techniques are good enough or not and how to further design more effective bug detection techniques. Moreover, it could limit the development of DL framework bug diagnosis techniques since this kind of tasks require much more sufficient understanding of detected bugs. That is, understanding the characteristics of DL framework bugs comprehensively is the fundamental task in the area of DL framework quality assurance, which is also the goal of our work.

In the literature, some studies on investigating DL bug characteristics have been conducted (Zhang et al., 2018b; Islam et al., 2019; Humbatova et al., 2020), but almost all of them target DL program bugs rather than DL framework bugs. Due to the differences between DL programs and DL frameworks, their bug characteristics are also different. Specifically, a DL program invokes the code in a DL framework to implement the corresponding functionality, and thus DL program bugs actually refer to those caused by incorrect usage of the DL framework rather than the bugs inside DL framework code. Regarding DL framework bug characteristics, Jia et al. (Jia et al., 2020) made the only one attempt till now, but it is still not enough to comprehensively understand bugs in the family of DL frameworks due to its small scale and limited study points (e.g., studying only one DL framework from three aspects). More details about the differences between our work and these existing studies can be found in Section 6. Hence, in this work we conduct a comprehensive study to facilitate the sufficient understanding of DL framework bugs.

Specifically, we used four popular DL frameworks in the study, including TensorFlow (41) from Google, PyTorch (36) from Facebook, MXNet (30) from Apache, and Deeplearning4j (DL4J) (5) from Eclipse, as experimental subjects. These subjects have great diversity, such as involving both static and dynamic computational graphs, various programming languages for implementations, and different development organizations. In total, we studied 800 real bugs that were collected from their bug repositories and manually labeled according to a systematic process (to be presented in Section 3). Based on the 800 bugs from four DL frameworks, our study aims to address the following research questions (RQs):

  • [leftmargin=*]

  • RQ1: What are the root causes of DL framework bugs and their distribution?

    The root causes are helpful to understand the nature of DL framework bugs. We not only classify the root causes of DL framework bugs, but also analyze their distribution on each component of DL frameworks (we will introduce the components of DL frameworks in Section 

    2). The distribution results can make the testing and debugging of each kind of DL framework bugs more targeted.

  • RQ2: What are the symptoms of DL framework bugs and their distribution? The symptoms are helpful to understand the effect of DL framework bugs. We not only classify the symptoms of DL framework bugs, but also analyze their distribution and investigate in which stage we observe these symptoms. The results can guide the improvement of test oracles for effective testing of DL frameworks.

  • RQ3: What is the relationship between root causes and symptoms of DL framework bugs? After studying the root causes and symptoms of DL framework bugs individually, it can obtain more comprehensive information about the bugs by investigating which root cause is more likely to produce a specific bug symptom.

  • RQ4: Do the bugs of different DL frameworks have commonality? We investigate whether there is some relationship among the bugs of different DL frameworks. It is helpful to guide the design of more general testing and debugging techniques for DL frameworks. Also, it may improve the testing and debugging practice of a DL framework by drawing the experience from the testing and debugging practice of other DL frameworks.

  • RQ5: What is the current status of existing DL framework testing techniques? There are some automated testing techniques proposed to detect DL framework bugs recently. Although they have been demonstrated to be effective to detect some new bugs, their testing capability has not been deeply investigated. To better understand the current status of these techniques, we study them in terms of the widely-used test coverage on each DL-framework component.

  • RQ6: What is the status of the current DL framework debugging practice? To our best knowledge, no automated debugging technique specific to DL framework bugs has been proposed, and manual debugging is still the mainstream method. Based on our identified root causes, we analyze developers’ fixing effort for each kind of bugs in terms of both fixing time and patch size in order to deeply understand the current debugging practice.

RQs 1-4 aim to understand the characteristics of DL framework bugs from several aspects (including individual aspects and their correlations), while RQs 5-6 aim to investigate the current status of existing testing and debugging practice for DL framework bugs associated with the identified characteristics. By systematically answering these RQs, it is helpful to comprehensively understand DL framework bugs and the limitation of the current testing and debugging practice, which could guide the direction of further improving testing and debugging performance by incorporating the identified bug characteristics.

In our study, we decomposed a DL framework into 5 components, identified 13 root causes and 6 symptoms of DL framework bugs, and investigated three state-of-the-art DL framework testing techniques in terms of test coverage and the efforts of the manual debugging practice. Through studying each aspect of the above individually and associating different aspects together, we obtained 14 major findings, and further provided a series of actionable guidelines for future DL framework bug detection and debugging.

To sum up, we make the following major contributions:

  • [leftmargin=*]

  • We conduct a comprehensive study on DL framework bugs based on 800 real bugs from four popular and diverse DL frameworks.

  • We provide a classification of root causes and symptoms of DL framework bugs, and associate them with each component of DL frameworks.

  • We investigate the current DL framework testing and debugging practice regarding test coverage and developers’ debugging effort.

  • We provide a series of guidelines for future DL framework testing and debugging according to our findings.

2. Deep Learning Frameworks

DL frameworks are the basis of implementing DL programs and building DL models. To complete a prediction task, developers have to implement a DL program by invoking the APIs provided by a DL framework, and then a model can be built by executing the DL program with training data. The core functionalities of a DL program include determining the structure of a neural network (e.g., selecting proper layers and setting their order) and configuring the training process (e.g., setting the optimizer and loss function). All the specific implementations for these DL functionalities are inside the used DL framework. Besides implementing various DL functionalities, DL frameworks are the bridge between DL functionalities and various hardware, and thus they also implement some strategies to support DL functionalities on different hardware. Indeed, DL frameworks are important for DL development and very complex especially compared with widely-studied DL programs.

Figure 1. Architecture of DL frameworks

According to the functionality, DL frameworks are decomposed into a five-level architecture in our work, as shown in Figure 1, for better understanding of bugs. It consists of five major components: User-Level API, Graph-Level Implementation, Operation Implementation, General Utility, and Environment-Dependent Processing, where User-Level API is the most high-level component that can be directly accessed by users to implement their DL programs while Environment-Dependent Processing is the most low-level one that is related to the underlying infrastructure.

1⃝ User-Level API. This component contains a large number of high-level APIs, which aims to provide convenience for users to use DL frameworks to conduct their DL tasks. According to the workflow of DL, this component can be further divided into four sub-components: 1) Data-Processing API aims to process the input data (e.g., images and text) to make them meet the corresponding requirement of a DL model, such as image resizing and text tokenization. 2) Model-Building API aims to construct a model structure and search for a group of optimal parameters for the model to make it well fit the training data via a given optimization target (e.g., a loss function).

For example, APIs for various layers and loss functions, as well as various optimizers (e.g., SGD and Adam) belong to it.

3) Model-Deployment API aims to integrate a trained DL model into an existing production environment to make practical predictions. Typically, it involves the processing (such as model quantization) that makes a DL model work in a specific environment. 4) Utility API. There are many utility APIs across the whole workflow of DL, which provide some auxiliary functionalities to facilitate the DL process, such as model visualization and checkpointing.

2⃝ Graph-Level Implementation

. After implementing a DL program based on these user-level APIs, the following process is mainly based on a static or dynamic computational graph. A computational graph is a directed graph, in which each node represents an operation (such as convolution operation). An operation can feed its outputs to another operation through an edge, and the values that flow along an edge are tensors. This component contains all the computational-graph-level implementations in DL frameworks.

According to the functionalities on a computational graph, this component can be divided into three sub-components: 1) Graph Construction, which aims to create a computational graph and obtain subgraphs via partitioning a graph for distributed execution (especially for a static graph). 2) Graph Transformation, which is responsible for graph optimization (e.g., common subexpression elimination and operation fusion) to improve computation performance, and graph conversion (e.g., converting to the ONNX format). 3) Graph Execution, which aims to execute the graph in a runtime environment, including local and distributed execution. For example, the execution process involves data propagation and gradient computation. Note that the functionalities of these sub-components are conducted in order for static computational graphs, but are mostly intertwined for dynamic computational graphs.

3⃝ Operation Implementation

. As presented above, each node in a graph is an operation. This component contains all the specific implementations for these operations. An operation takes zero or more tensors as input and produces zero or more tensors as output. There are a large number of operations implemented in DL frameworks, such as convolution operations, pooling operations, batch normalization operations, mathematical operations (e.g., log), and array manipulation operations (e.g., shuffle).

4⃝ General Utility

. To facilitate the implementations of the above components, there are many general utilities in DL frameworks, including common data structures and common functions (such as type conversion and padding functions). This component includes all these general utilities.

5⃝ Environment-Dependent Processing. This component is the most low-level one, which aims to support the functionalities of DL frameworks work well in different environments. A typical example in this component is the memory allocation strategies on different devices, which aim to achieve high efficiency on different devices by considering their corresponding characteristics. That is, this component contains all the implementations establishing connections between the functionalities of DL frameworks and environments, including both hardware environments (e.g., GPU) and software environments (e.g., different operating systems).

We select four popular DL frameworks, i.e., TensorFlow (41) from Google, PyTorch (36) from Facebook, MXNet (30) from Apache, and DL4J (5) from Eclipse, as subjects. All of them are built with the above architecture, but they are also diverse, e.g., involving different programming languages for implementations, different development organization, and different types of computational graphs.

3. Methodology

3.1. Data Collection

In the study, we aim to investigate the characteristics of DL framework bugs, and thus we collected closed and merged pull requests that are responsible to fix bugs from the corresponding GitHub repositories following the existing work (Islam et al., 2019; Garcia et al., 2020; Shen et al., 2021). On the one hand, bugs involved in these pull requests have been accepted and fixed by developers; On the other hand, these pull requests tend to contain more comprehensive information, e.g., code changes, links to related issues, and discussions among developers, which is helpful to understand the bugs. In addition, not all of such pull requests are responsible to fix bugs, such as adding new features or updating documents. Hence, we further identified bug-fixing pull requests through keyword searching in the tags and titles of pull requests. Following the existing work(Islam et al., 2019; Garcia et al., 2020; Shen et al., 2021), we adopted several bug-relevant keywords, including fix, defect, error, bug, issue, mistake, correct, fault, and flaw.

Since manually analyzing bugs is very time-consuming, it is unaffordable for us to analyze all the historical bugs of these DL frameworks. Therefore, to balance study scale and cost, we collected 200 bugs in each studied DL framework for manual inspection, which is the largest scale to our best knowledge in this area. Specifically, for each studied DL framework, we collected its bugs by manually analyzing bug-fixing pull requests in the reversed order of time from June, 2021. During the collection process, we discarded the pull requests that are actually irrelevant to bug fixing, and labeled the bugs as the process presented in Section 3.2. Once we collected 200 bugs for a DL framework, we stopped the collection process for this DL framework. In total, we collected 800 bugs of the four DL frameworks by analyzing 954 bug-fixing pull requests.

3.2. Classification and Labeling Process

In the study, for each studied DL framework bug, we labeled its root cause, the symptom that the bug exhibits, and the component in which the bug occurs. In this work, we adopted the taxonomies of root causes and symptoms from existing work (Garcia et al., 2020; Islam et al., 2019; Thung et al., 2012a; Tan et al., 2014; Humbatova et al., 2020) as the initial taxonomies and then adapted them to DL framework bugs by adding DL-framework-specific categories and removing irrelevant categories via an open-coding scheme following existing studies (Islam et al., 2019; Shen et al., 2021). Regarding the components of DL frameworks, we have introduced them in Section 2.

During the labeling process, two authors label each bug individually. Following existing work (Islam et al., 2019; Shen et al., 2021), we also adopted the Cohen’s Kappa coefficient (Vieira et al., 2010) to measure the inter-rater agreement between them. After obtaining the first 5% of labeling results, the Cohen’s Kappa coefficient was just about 40%, and then the two authors discussed all these inconsistent results with the third author so that they were further trained for better labeling. Then, after obtaining the first 10% of labeling results (including the first 5%), the Cohen’s Kappa coefficient reached 80%. Through further discussion with the third author on inconsistencies, the Cohen’s Kappa coefficients were over 95% in all the subsequent labeling studies (i.e., labeling 20%100% of bugs with the interval of 10%). For all the inconsistent results in each study, the two authors discussed them with the third author until all the bugs were labeled consistently. Please note that there are some pull requests where more than one bugs were fixed, we treated each of them as an individual bug following the existing work (Garcia et al., 2020; Shen et al., 2021).

4. Results and Analysis

4.1. RQ1: Root Causes

4.1.1. Root Cause Classification Results

Based on the above classification and labeling process, we identified the following 13 root causes of DL framework bugs. The former four root causes involve the characteristics of DL frameworks, while the others are common categories. In the study, we not only discuss these root causes very relevant to DL frameworks, but also investigate whether these common root causes have different distributions and characteristics between DL frameworks and traditional software.

1⃝ Type Confusion. This kind of bugs involves type-related problems, such as type conversion and type checking. Besides traditional variables, tensors are quite widely-used in the development of a DL framework, and thus we divide this root cause into two sub-categories. 1) Tensor type confusion, which refers to the type-confusion bugs caused by the types of tensors. Specifically, a tensor is a multi-dimensional matrix consisting of elements with a single data type.

2) Traditional data type confusion, which refers to the type-confusion bugs caused by the types of traditional variables like traditional software.

2⃝Tensor Shape Misalignment. This kind of bugs is caused due to tensor shape mismatching in shape-related operations, e.g., tensor shape inference and transformation. Specifically, tensor shape describes the number of elements in each dimension.

3⃝Incorrect Algorithm Implementation. This kind of bugs is caused due to the problematic implementation logic of an algorithm, which tends to involve a number of statements or blocks. According to the functionality of an algorithm, we divide this root cause into two sub-categories. 1) Incorrect DL-specific algorithm implementation: there are a large number of algorithms with DL-specific functionalities in a DL framework (such as operation fusion and gradient computation algorithms), and this sub-category of bugs occur due to the incorrect implementation logic of these DL-specific algorithms. 2) Incorrect DL-irrelevant algorithm implementation: This sub-category of bugs occur due to the incorrect implementation logic of the algorithms with DL-irrelevant functionalities (such as memory allocation algorithms).

4⃝Environment Incompatibility. This kind of bugs is caused due to neglecting some characteristics (e.g., the endianness for an architecture) of a specific environment (e.g., hardware or operating systems). This root cause is common in DL frameworks since DL frameworks are required to work on various environments, such as CPU and GPU with various architectures.

5⃝API Incompatibility. This kind of bugs contains two sub-categories. 1) Internal incompatibility refers to the API compatibility issues within a DL framework caused by API evolution; 2) External incompatibility refers to the API compatibility issues between a DL framework and third-party libraries (such as NumPy and HIPIFY) caused by the update of the latter.

6⃝API Misuse. This kind of bugs is caused due to misunderstanding of APIs, which contains three sub-categories in DL framework bugs. 1) Condition missing/redundancy means that a condition check for an API is missing (or redundantly used); 2) API missing/redundancy means that an API is missing to call (or redundantly called); 3) Wrong API means that a wrong API name, argument, or receiver is used. By taking an API call a.b(x,y) as an example, a is the receiver, b is the API name, and x and y are the arguments.

7⃝Incorrect Assignment. This kind of bugs occurs when a variable is incorrectly assigned or lacks initialization.

8⃝Incorrect Exception Handling. This root cause refers to that the way of handling an exception is incorrect, which includes three scenarios: 1) Missing exception, i.e., a DL framework is missing to throw an exception when coming across an exception; 2) Spurious exception, i.e., a DL framework throws an exception when it should not; 3) Wrong exception message, i.e., a DL framework produces an incorrect/imprecise exception message for an exception.

9⃝Misconfiguration. This kind of bugs is caused due to incorrect configurations in a DL framework, such as configurations in Bazel files and various Shell configuration scripts.

1⃝0Numerical Issue. This kind of bugs is caused due to incorrect numerical computations, such as dividing by 0, using wrong operators or operands, and missing operands.

1⃝1Concurrency Issue. This kind of bugs is caused by incorrect operations on concurrency-oriented structures, such as threads, shared memory, and race conditions.

1⃝2Dependent Module Issue. This kind of bugs is caused due to missing to import dependent modules or importing wrong modules.

1⃝3Others. Each bug in this root cause is unusual and cannot be assigned to any other root causes.

4.1.2. Root Cause Distribution

Figure 2. Bug Distribution by Root Causes

Figure 2 shows the bug distribution by the identified root causes. From this figure, the four root causes involving the characteristics of DL frameworks (i.e., Incorrect Algorithm Implementation, Type Confusion, Tensor Shape Misalignment, and Environment Incompatibility) are indeed common, all of which are ranked within Top-6 (among 13 root causes) and account for 50.5% of bugs in total. Among all these root causes, Incorrect Algorithm Implementation is the most common one. It accounts for 120 bugs in total, including 22 in TensorFlow, 31 in PyTorch, 33 in MXNet, and 34 in DL4J. The reason mainly lies in that deep learning is a fast-growing area and thus DL frameworks have to be frequently updated to incorporate the rapid advancement in DL algorithms. Moreover, hardware (especially DL-related hardware) is also rapidly developed and thus DL frameworks are required to provide the corresponding implementations to support these new features in hardware. Regardless of supporting advanced DL algorithms or new hardware features, the corresponding implementations in DL frameworks tend to involve complicated code logic, and thus it is very likely for them to incur various technical debts. Through further analysis, we found that about 81.67% of this kind of bugs (98 out of 120) occur in the implementations of DL-specific algorithms, significantly outnumbering the bugs in the implementations of DL-irrelevant algorithms.

Finding 1: Regarding the root causes involving DL framework characteristics, all of them are common, accounting for 50.5% of bugs in total. Among them, the most common root cause is Incorrect Algorithm Implementation (especially in DL-specific algorithms).

Type Confusion is the second most common root cause among all the root causes.

It accounts for 119 bugs in total, including 34 in TensorFlow, 25 in PyTorch, 25 in MXNet, and 35 in DL4J. Through further investigation, nearly 69.75% of this kind of bugs (83 out of 119) are caused by tensor types rather than traditional data types. This is because all the DL operations depend on tensors, and meanwhile tensor type is an important property in a tensor and usually involved in various operations. In particular, type conversion, especially implicit type conversion, tends to incur Type Confusion bugs in DL frameworks, which deserves more attention in practice.

Finding 2: Type Confusion is the second most common root cause, which accounts for 14.88% of DL framework bugs and mainly occurs on tensor types.

Besides, there are common categories of root causes between DL frameworks and traditional software, and some of them are also notable. Besides the four root causes involving DL framework characteristics, the remaining two root causes ranked within Top-6 are Misconfiguration and API Misuse. In particular, Misconfiguration is the third common one among all the root causes, which accounts for 111 bugs in total. The phenomenon is different from the existing studies on traditional software bugs since Misconfiguration bugs either are ignored by the latter (Tan et al., 2014; Sun et al., 2016b) or account for only a small percentage among all the studied bugs (Ocariza et al., 2013; Thung et al., 2012b). For example, as shown in the existing study (Thung et al., 2012b)

, only 5.7% of bugs are caused by Misconfiguration in traditional machine learning systems, which is ranked at

position among their identified 11 root causes. The reason why DL frameworks contain many Misconfiguration bugs may lie in that, there are a large number of configuration files/options for compilation, installation, and ensuring compatibility of DL frameworks due to their complex implementations involving multiple programming languages as well as the large number of dependent third-party libraries and hardware/software environments.

width=0.95center Framework API M/R Condition M/R Wrong API Receiver Name Args Total TensorFlow 4 3 2 5 7 14 PyTorch 3 2 0 8 5 13 MXNet 2 1 6 4 6 16 DL4J 8 2 4 11 8 23 Total 17 8 12 28 26 66

  • M/R is short for Missing/Redundancy.

Table 1. Distribution of API Misuse Bugs

API Misuse is another common root cause for DL framework bugs. Indeed, this root cause is also common in traditional software systems, but it is unknown whether they manifest in the same way or not. To further investigate it, we then divide this kind of bugs into three subcategories, as shown in Table 1, following the existing work (Amann et al., 2018; Zhang et al., 2018a). From Table 1, 72.53% of API Misuse bugs (66 out of 91) are due to using wrong APIs. However, as demonstrated in the existing studies on MuBench (Amann et al., 2016, 2018) (one of the most-widely studied benchmarks in the area of API misuse, including 90 API misuses from Java projects), API Missing/Redundancy is the most common subcategory. That is, while API Misuse is a common root cause for both DL frameworks and traditional software, they actually manifest in a different way. The result indicates that in DL frameworks, developers may usually confuse different API usage scenarios, especially for a set of similar APIs, calling for new API misuse detection methods that could distinguish those APIs clearly.

Finding 3: Regarding the common categories between DL frameworks and traditional software, Misconfiguration and API Misuse are two most notable root causes, but those bugs in DL frameworks have different characteristics and distributions with those in traditional software, indicating the necessity of specially studying DL framework bugs.

We further analyzed the distribution of each kind of bugs on each component in DL frameworks as shown in Table 2. Here, we excluded Misconfiguration bugs or bugs caused by external configuration files in this table, since our five-level architecture of DL frameworks does not contain configuration files and almost all the Misconfiguration bugs occur in configuration files. From this table, the number of bugs occurring at the component of Operation Implementation (i.e., 202) is the largest due to involving hundreds even thousands of operations’ implementations that tend to contain complicated code logic, and that at the component of User-Level API (i.e., 161) is the second largest. In particular, the bugs caused by the four root causes involving DL framework characteristics are chiefly distributed in the component of Operation Implementation, while the common root-cause categories of bugs are chiefly distributed in the component of User-Level API. The results indicate that the component of User-Level API is more similar to traditional software and the component of Operation Implementation is more specific to DL frameworks. Therefore, it is likely to apply existing testing and debugging techniques to the former component, which may facilitate to ensure its quality to a large degree, while new testing and debugging techniques targeting DL framework characteristics are desirable for the latter component. Furthermore, regarding the component of Graph-Level Implementation, Incorrect Algorithm Implementation and Tensor Shape Misalignment are two major causes, since it involves many DL-specific algorithms (such as various graph transformation algorithms) and takes tensors as the basic elements of computational graphs. In particular, 47.92% (69 out of 144) bugs in this component occur in Graph Transformation. As expected, Environment Incompatibility is the major cause for the bugs in the component of Environment-Dependent Processing.

Finding 4: Different components of DL frameworks have different bug distribution characteristics, calling for different testing and debugging techniques. The component of Operation Implementation contains the most bugs and DL-specific bugs are mainly distributed in this component. The component of User-Level API takes second place, but traditional categories of bugs are mainly distributed in it.

width=0.9center [width=16em]Root CausesComponents UA GI OI GU EP Type Confusion 28 18 33 35 5 Tensor Shape Misalignment 11 25 36 20 0 Incorrect Algorithm implementation 28 35 40 14 2 Environment incompatibility 7 14 17 6 29 API Incompatibility 6 3 2 2 0 API Misuse 32 18 21 16 4 Incorrect Assignment 15 6 12 4 3 Incorrect Exception Handling 12 8 11 11 1 Numerical Issue 3 3 13 6 0 Concurrency Issue 5 7 3 4 0 Dependent Module Issue 6 3 3 0 0 Others 8 4 11 4 3 Total 161 144 202 122 47

  • UA: User-Level API; GI: Graph-Level Implementation; OI: Operation Implementation; GU: General Utility; EP: Environment-Dependent Processing

Table 2. Bug Distribution by Root Causes in Components

4.2. RQ2: Symptoms

4.2.1. Symptom Classification Results

We identified the following 6 symptoms of DL framework bugs.

1⃝Crash. This symptom refers to that a DL framework terminates unexpectedly during running, such as terminating with an error message like “out of memory” or “null pointer”.

2⃝Incorrect Functionality. This symptom refers to that a DL framework behaves incorrectly but does not crash, such as producing unexpected prediction results, unexpected model structures, or incorrect intermediate states.

3⃝Build Failure. This symptom refers to that a DL framework fails to be installed.

4⃝Poor Performance. This symptom refers to that the spent time or consumed resource (such as memory) is much larger than expectation during the usage of a DL framework.

5⃝Hang. This symptom refers to that a DL program written on top of a DL framework cannot terminate within a long period of time (due to a DL framework bug).

6⃝Unreported. We cannot identify the symptoms for some bugs after carefully reading the corresponding pull requests, including the related issues, discussions, and code changes.

Figure 3. Bug Distribution by Symptoms

4.2.2. Symptom Distribution

Figure 3 shows the bug distribution by the identified symptoms. We found that Crash is the most common symptom. The number of bugs exhibiting this symptom is 93, 83, 104, and 106 for TensorFlow, PyTorch, MXNet, and DL4J respectively, and the total number is 386. The detection of this kind of bugs has an explicit test oracle, and thus automated test input generation (that tends to suffer from the test oracle problem but does not here) has a great potential to facilitate the detection of the large percentage of Crash bugs. Also, Crash bugs occur with error messages, which can provide hints for the bugs, and thus designing effective debugging techniques based on those informative messages is beneficial for such a large percentage of Crash bugs.

Finding 5: Crash is the most common symptom for DL framework bugs, which accounts for 48.25% of bugs.

Incorrect Functionality takes second place, which accounts for 206 bugs in total, including 34 in TensorFlow, 66 in PyTorch, 47 in MXNet, and 59 in DL4J. The major challenge for detecting this kind of bugs lies in the test oracle problem. Specifically, a DL framework is used by developers for implementing a DL program and then building a DL model, but it is difficult to determine whether the DL program/model is as expected due to its complexity. Through analyzing the large number of studied Incorrect Functionality bugs, we found that they were often detected by checking whether the prediction results of the built model, the model structure, or some intermediate states (e.g., the calculation results of some operations) are as expected. That is, deciding which information is observed and how to determine its expected result are important but indeed challenging to detect this kind of bugs.

Finding 6: Incorrect Functionality is the second most common symptom for DL framework bugs, accounting for 25.75% of bugs. Defining effective test oracles deserves much more attention for the detection of this kind of bugs.

width=0.99center SymptomsStages Install Preprocess Train Deploy Utility Crash 2 15 260 61 48 Incorrect Functionality 2 8 132 28 36 Build Failure 144 0 7 5 3 Poor Performance 1 1 14 2 2 Hang 0 0 1 1 0 Unreported 0 1 16 8 2 Total 149 25 430 105 91

Table 3. Bug Distribution by Symptoms in Each Stage.

Based on the symptoms, we then analyzed when we can observe these bugs. From the view of DL framework users, we classified the DL pipeline into five stages following the existing work (Islam et al., 2019; Du et al., 2020; Hapke and Nelson, 2020): 1⃝Installation: the stage of installing the DL framework to be used; 2⃝Preprocessing: the stage of preprocessing the dataset used for model building; 3⃝Training: the stage of training and validating a model; 4⃝Deployment: the stage of deploying the built model to a device; 5⃝Utility Operation: the stage of conducting auxiliary operations, e.g., model visualization. Table 3 shows the bug distribution according to the stage in which the bugs with each symptom were observed. From the view of DL framework users, DL framework bugs are mainly observed at the stage of Training (over 53.75%). The Training stage tends to be time-consuming due to heavy numerical computation based on a large amount of training data. As presented in the existing study (Yan et al., 2021b), the typical training time ranges from a few minutes to several days. Hence, the bugs observed at this stage, especially those Incorrect Functionality bugs (account for 30.7% of bugs observed at this stage), may be manifested after hours or even days into the training process. This is very harmful to the efficiency of both testing and debugging. Particularly, the training process for exposing the bugs has to be repeated several times to validate whether a fix is correct. Also, the training process involves randomness, aggravating the difficulty of bug reproduction during debugging. Hence, the large percentage of bugs observed at the Training stage suggests the urgent need of speeding up the process of exposing bugs at this stage.

Finding 7: Over 53.75% of DL framework bugs are observed at the Training stage. It could lead to lengthy testing and debugging for them, especially the large number of Incorrect Functionality bugs without halfway crashes, due to the costly and non-deterministic training process.

4.3. RQ3: Relationship between Root Causes and Symptoms

Table 4 presents the number of each kind of bugs exhibiting each symptom. Here, Crash and Incorrect Functionality are the most common symptoms for all the root causes (except Misconfiguration for both, and Environment Incompatibility, API Incompatibility, Dependent Module Issue for the latter). The result indicates designing effective test oracles targeting the two symptoms is helpful to detect a wide variety of DL framework bugs. As presented before, Crash has an explicit test oracle, while the test oracle problem is the major challenge for detecting Incorrect Functionality bugs. Currently, differential testing has been adopted as the test oracle for the latter (Pham et al., 2019; Wang et al., 2020b), but it could lead to false positives and false negatives due to the randomness in DL (which is different from traditional software). Hence, more precise test oracles are still desirable.

Finding 8: Crash and Incorrect Functionality can be exhibited by various root causes of DL framework bugs.

Regarding the symptoms of Build Failure and Poor Performance, they can be produced by some specific root causes. Specifically, among the 159 bugs exhibiting the symptom of Build Failure, 66.67% are produced by Misconfiguration and 13.84% are produced by Environment Incompatibility. Among the 20 bugs exhibiting the symptom of Poor Performance, 60% are produced by Incorrect Algorithm Implementation or API Misuse. Therefore, when a bug occurs with the two symptoms, developers can first check these highly relevant root causes to speed up the debugging process.

Finding 9: The symptom of Build Failure is highly relevant to the root causes of Misconfiguration and Environment Incompatibility, while the symptom of Poor Performance is highly relevant to the root causes of Incorrect Algorithm Implementation and API Misuse.

width=0.49center [width=15em]Root CauseSymptom Crash IF BF PP Hang Unreport Type Confusion 75 31 3 3 1 6 Tensor Shape Misalignment 58 32 0 0 0 2 Incorrect Algorithm Implementation 68 35 3 6 0 8 Environment Incompatibility 40 7 22 1 1 2 API Incompatibility 11 2 6 1 0 0 API Misuse 44 29 5 6 0 7 Incorrect Assignment 20 19 1 0 0 0 Incorrect Exception Handling 25 18 0 0 0 0 Misconfiguration 3 2 106 0 0 0 Numerical Issue 6 18 1 0 0 0 Concurrency Issue 12 4 0 2 0 1 Dependent Module Issue 10 0 5 0 0 0 Others 14 9 7 1 0 1 Total 386 206 159 20 2 27

  • IF: Incorrect Functionality BF: Build Failure PP: Poor Performance

Table 4. Bug Distribution by Root Causes for Each Symptom

4.4. RQ4: Bug Commonality

We calculated the Spearman correlation between each pair of DL frameworks in terms of root cause distribution and symptom distribution respectively, in order to measure the bug commonality across different DL frameworks. Spearman’s correlation coefficient is a statistical measure of the strength of a monotonic relationship between two paired variables (Zar, 2005). Figure 4 shows the correlation results, where [,] indicates the very strong correlation and [,) indicates the strong correlation. From this figure, all the correlation coefficients are larger than 0.7 in terms of root cause distribution and all the correlation coefficients are larger than 0.94 in terms of symptom distribution. The results show that regardless of root causes or symptoms, the four DL frameworks share a high degree of commonality, demonstrating the generality of our findings in the study and the potential of developing general testing and debugging techniques for various DL frameworks.

Figure 4. Spearman Correlation across DL Frameworks

Finding 10: There is significant commonality among the four DL frameworks in both root causes and symptoms.

4.5. RQ5: Status of Existing Testing Techniques

To investigate the current status of existing testing techniques, we analyzed them in terms of test coverage on each DL-framework component. Here, we studied three DL framework testing techniques, i.e., CRADLE (Pham et al., 2019), LEMON (Wang et al., 2020b), Audee (Guo et al., 2020)

. All of them adopt differential testing as the test oracle. Their main difference lies in the used test inputs: CRADLE is the first technique, which takes real-world pre-trained DL models as test inputs; LEMON and Audee adopt different search-based mutation strategies to generate mutated models based on pre-trained models as test inputs, where the former proposes to mutate the layers, neurons, and weights of pre-trained models and the latter proposes to mutate the parameters of layers, weights, and inputs (e.g., images).

Here, we used 8 pre-trained models widely-used in the existing studies (Wang et al., 2020b; Guo et al., 2020)

, involving different model structures and different sets of input data. They are LeNet-5 trained on MNIST, LeNet-5 trained on Fashion-MNIST, AlexNet trained on Cifar10, MobileNetV2, ResNet-50, and VGG-16 trained on ImageNet, and two LSTM models trained on Sinewave and Price. CRADLE uses the 8 models as test inputs directly, while LEMON and Audee produced 100 mutated models based on each pre-trained model respectively and the latter also produced mutated input data for each mutated model. In total, there are 800 mutated models as test inputs for LEMON and Audee, respectively. We measured the achieved test coverage (i.e., line, branch, and function coverage) by the three techniques respectively via

Gcov (for C code coverage collection) (10) and Coverage.py (for Python code coverage collection) (4). Since collecting DL framework coverage is costly and different DL frameworks share significant bug commonality as presented in Section 4.4, we used MXNet as the representative in this experiment. In particular, we also ran the equipped test suite in MXNet (version 1.9.0) and collected the achieved coverage to facilitate analysis.

We first measured the coverage results achieved by the three testing techniques together, and also compared them with the coverage result achieved by the equipped test suite, whose results are shown in Table 5. We found that the line, branch, and function coverage achieved by these testing techniques are only 24.19%, 4.58% and 23.38% respectively, which are significantly smaller than those achieved by the equipped test suite (i.e., 70.51%, 11.88%, and 38.99%). That is, the current testing techniques suffer from the low test coverage issue. It is very harmful to the testing performance since test coverage is the first condition of bug detection according to the PIE theory (Voas, 1992). Hence, improving test coverage is an important direction of designing new DL framework testing techniques. Also, from Table 5, the current testing techniques achieve relatively high test coverage on the components of User-Level API and Graph-Level Implementation (especially function coverage on the User-Level API component), but achieve low test coverage on the remaining three components. The results suggest that focusing on the remaining three components could be more helpful to improve test coverage.

Finding 11: The current DL framework testing techniques suffer from the low test coverage issue, especially on the components of Operation Implementation, General Utility, and Environment-Dependent Processing.

width=0.49center Component UA GI OI GU EP Overall Test Suite line 72.78% 68.75% 72.56% 65.58% 39.42% 70.51% branch 59.07% 29.72% 10.57% 17.11% 19.11% 11.88% function 93.87% 62.23% 34.47% 51.54% 46.09% 38.99% CRADLE+ LEMON+ Audee line 30.22% 38.37% 20.29% 17.79% 13.00% 24.19% branch 8.65% 18.00% 4.05% 3.81% 7.57% 4.58% function 91.51% 37.54% 18.65% 20.56% 15.33% 23.38%

  • UA: User-Level API; GI: Graph-Level Implementation; OI: Operation Implementation; GU: General Utility; EP: Environment-Dependent Processing

Table 5. Coverage of Testing Techniques and Test Suite

We then compared the test coverage achieved by each of the three testing techniques. Figure 5 shows the Venn diagrams to analyze the overlaps of their covered lines, branches, and functions. We found that the number of unique lines, branches, and functions covered by each technique is small, especially compared with those that can be covered by all of them. The results indicate that these techniques have significant commonality in terms of test coverage. In particular, both LEMON and Audee are on the basis of CRADLE, and according to Figure 5 we found that their achieved test coverage mainly depends on the used pre-trained models and the coverage increments achieved by the mutated models by both LEMON and Audee are small. That suggests that using more pre-trained models could facilitate to improve test coverage of the current testing techniques, and meanwhile it is necessary to design new techniques with great diversity compared with these existing ones.

Finding 12: The current DL framework testing techniques share significant commonality in terms of test coverage and their coverage mainly depends on the used pre-trained models rather than mutated models by LEMON and Audee.

Figure 5. Number of Unique Covered Elements

4.6. RQ6: Status of Debugging Practice

Since there is no automated debugging techniques specific to DL framework bugs, manual debugging is still the mainstream method. Hence, we analyzed the fixing effort of developers for each kind of bugs in terms of both fixing time and patch size to facilitate the understanding of the current debugging practice. Specifically, we selected all the bugs with the related issue reports from our collected dataset, obtaining 299 bugs in total. Following the existing work (Sun et al., 2016a)

, for an issue report we extracted its opening time and the closed time of the corresponding pull request to fix the issue, and regarded the time interval between them as the approximation of the bug-fixing time. Also, we measured the patch size (in terms of the number of modified lines and functions in a patch) of each bug to complement the estimated bug-fixing time.

Table 6

shows the results of the patch size and estimated bug-fixing time, where “Mean”, “Med”, and “std” refer to the mean, medium, and standard deviation results respectively. From this table, the mean and medium numbers of lines in a patch are 68.3 and 13, and those of functions are 4 and 1. We found the above bug-fixing efforts are even larger than those on the large-scale and complex C compiler (i.e., GCC), whose mean and medium numbers of lines in a patch are 43 and 10 and those of functions are 2.7 and 1, as presented in the existing study 

(Sun et al., 2016b). The results indicate that the bug-fixing efforts for DL framework bugs are indeed significant, calling for effective automated debugging techniques.

Finding 13: The bug-fixing efforts of DL framework developers are significant in terms of the patch size and estimated bug-fixing time.

From Table 6, regarding the estimated patch size and function number, fixing DL framework bugs due to DL-involving root causes is more costly than those due to common root causes. The reason may be that the cause-effect chain of the former is longer than that of the latter. Specifically, the former tends to involve the propagation of tensors due to the inherent characteristics of deep learning, while the latter usually focuses on a local code region, such as dividing by zero and using a wrong API. In particular, the bugs caused by Incorrect Algorithm Implementation and Environment Incompatibility are required the most significant bug-fixing efforts, since they either involve very complicated code logic or are required to understand various complex features of hardware.

Finding 14: The bug-fixing efforts spent on the bugs caused by DL-involving root causes, especially Incorrect Algorithm Implementation and Environment Incompatibility, are more significant than those caused by common root causes. It suggests that effective automated debugging techniques specific to the former are in more urgent need.

width=center [width=14em]Root CauseMetric #Line #Function Fixing Time (day) Mean Med Std Mean Med Std Mean Med Std Type Confusion 42.8 13.0 73.2 4.6 2.0 8.0 58.4 15.0 120.5 Shape Misalignment 44.4 11.5 94.5 3.6 2.0 6.1 22.0 6.0 49.0 Incorrect Algorithm Implement 179.8 66.5 374.5 9.8 3.0 16.1 33.2 16.0 49.4 Environment Incompatibility 43.5 15.0 76.0 2.6 1.0 3.4 253.4 28.0 566.9 API Incompatibility 152.3 8.5 465.4 2.7 1.0 4.3 API Misuse 42.6 6.0 197.0 3.4 1.0 7.3 22.2 9.5 28.7 Incorrect Assignment 17.0 5.5 30.2 2.6 1.0 6.4 34.8 1.0 84.9 Incorrect Exception Handling 26.4 6.0 78.2 4.3 1.0 14.5 28.9 6.0 44.2 Misconfiguration 58.7 11.0 156.3 0.2 0.0 0.4 25.3 6.0 45.4 Numerical Issue 38.0 13.0 89.6 3.0 2.0 2.9 Concurrency Issue 44.5 15.0 62.2 4.4 2.0 6.9 Dependent Module Issue 18.1 7.0 27.1 0.9 0.5 1.1 Others 28.6 7.5 44.5 3.6 2.0 4.2 Total 68.3 13.0 235.4 4.0 1.0 9.2 56.2 10.0 193.6

  • “–” indicates that we did not report the bug-fixing time for the root causes for which the number of bugs with related issue reports is smaller than 15, which lacks statistical significance.

Table 6. Statistics of Bug-Fixing Efforts

5. Discussion

5.1. Implications

5.1.1. Implications on Bug Detection

New mutation operators. Based on Findings 1 and 2, DL-involving root causes can result in a large percentage of bugs, and thus defining new mutation operators specific to their characteristics is helpful to efficiently explore whether DL frameworks can handle various cases involving them correctly. New mutation operators can include: 1) type mutation: many Type Confusion bugs are caused by type conversion, especially implicit type conversion, and thus we can add typecast for tensors so that implicit type conversion may be triggered in tensor computation; 2) shape mutation: we can create the scenarios, in which various tensor shapes can be involved to check whether they match, by inserting new layers with diverse shapes into different contexts; 3) environment mutation: we can put a DL program into various environments for model building, to test whether the used DL framework can stably support the training process.

Test oracle improvement. Based on Findings 5, 6, 8, Crash and Incorrect Functionality are two most common symptoms for DL framework bugs, and thus designing effective test oracles with regard to them can cover a large percentage of bugs. Regarding Crash, it has an explicit test oracle with error messages, but we still found many bug reporters complained that the error messages are ambiguous, which could affect the follow-up debugging process. For example, among 43 bugs caused by Incorrect Exception Handling, 58.14% are due to wrong exception messages. Hence, it is necessary for developers to refine error messages to make them precise and informative. Regarding Incorrect Functionality, although differential testing on multiple DL frameworks has been adopted in the existing DL framework testing techniques by pre-defining a threshold for determining an inconsistency, it still cannot precisely identify Incorrect Functionality bugs due to inherent non-determinism in DL. To reduce false positives and false negatives, a voting mechanism can be incorporated by integrating several test oracles, including differential testing on multiple versions of one DL framework as well as multiple environments, and metamorphic testing by constructing a group of equivalent tests. Although integrating various test oracles may relieve the test oracle problem to some degree, new test oracles specific to such non-determinism definitely deserve more attention from the software engineering community.

Component-targeted testing. In general, it is challenging to design a general testing technique that can effectively detect bugs occurring at various components, which can be demonstrated by Findings 11 and 12 to some degree (i.e., all these general testing techniques suffer from the low test coverage issue, especially on some components). Hence, conducting component-targeted testing could be more practical. According to Findings 4 and 11, we can assign the component of Operation Implementation the highest priority for designing targeted testing techniques, because this component involves the largest number of bugs but has little test coverage regardless of using existing testing techniques or the equipped test suite. To achieve the targeted testing for the component of Operation Implementation, it is better to construct tests on the computational graph level, since it can more directly invoke and operate various operations to achieve high coverage compared with the widely-used model level by the existing testing techniques.

5.1.2. Implications on Debugging

Efficient reproduction. Based on Finding 7, over 53.75% of bugs occur at the training process. Since the training process involves heavy numerical computation based on a large amount of training data and inherent non-determinism, bug reproduction is unstable and time-consuming, leading to the costly debugging process (demonstrated by Findings 13 and 14 to some degree). Therefore, efficient bug reproduction deserves much more attention. One promising direction may be to shorten the training process by simplifying the model structure and reducing the amount of training data, which can still trigger the bug but with higher efficiency. Here, adapting the idea of delta debugging (Zeller and Hildebrandt, 2002) to both model and training data may be effective to achieve this goal.

Build failure fixing. Based on Findings 3 and 9, Build Failure is common and has two highly relevant root causes, i.e., Misconfiguration and Environment Incompatibility. Thus, there are hints for fixing building failures, making the design of automated methods feasible. Indeed, there are some automated build failure fixing methods proposed for traditional software (Hassan and Wang, 2018; Lou et al., 2019), but these methods tend to target the gradle build framework (12), which is different from the one depended by DL frameworks. Moreover, as shown in Finding 3, Misconfiguration bugs in DL frameworks have different characteristics with those in traditional software. Hence, not only investigating whether the existing methods still work on build failures of DL frameworks is valuable, but also designing new methods specific to the characteristics of DL frameworks is necessary.

5.2. Threats to Validity

The external threat to validity lies in our used data. We systematically collected 800 bugs of four DL frameworks as our study data, including collecting closed and merged pull requests, identifying bug-fixing pull requests via keyword searching, and conducting manual investigation following the existing work (Shen et al., 2021; Zhang et al., 2018b; Islam et al., 2019). Thus, a big confidence can be obtained regarding the high quality of our data. As the most large-scale study on DL framework bugs, the generalizability of our study can be demonstrated to a large extent.

The internal threat to validity lies in our manual labeling process. To mitigate the inaccuracy and subjectivity of each individual developer, two authors with over 4-year developing experience conducted the labeling process independently, and we leveraged the Cohen’s Kappa coefficient to measure the inter-rater agreement between them, where a coefficient as high as 95% is reached, indicating a high agreement between them. Besides, we involved a third senior developer for the discussion of discrepancies to substantially improve the reliability of labeling results.

6. Related Work

The most related work to ours is the empirical study on TensorFlow bugs (Jia et al., 2020, 2021). This study is the only one on investigating DL framework bugs, but it is not enough to comprehensively understand bugs in the family of DL frameworks: 1) It investigates the bugs in only one DL framework (i.e., TensorFlow), while our study analyzed 800 bugs of four popular and diverse DL frameworks. That shows that our study is indeed large-scale and general (e.g., obtaining more general root-cause and symptom taxonomies), and facilitates the understanding of bugs across different DL frameworks. 2) It directly uses the folders organizing TensorFlow code as the component categories, which cannot be generalized to other DL frameworks. However, our work proposes a general top-down five-level architecture for DL frameworks, and analyzed root cause distribution on each component, which facilitates the more fine-grained understanding of DL framework bugs. 3) Our study involves more study points, including studying bugs from some individual aspects (e.g., root causes) as well as associating different aspects for comprehensive analysis (e.g., associating root causes with DL framework components). In particular, our study further associates our identified bug characteristics with existing testing and debugging practice for DL framework bugs, in order to dissect the current status in testing and debugging DL frameworks and then guide the direction of improving them. Therefore, we believe that our study makes significantly novel contributions to understanding DL framework bugs comprehensively and further ensuring DL frameworks’ quality.

There are also some studies on investigating DL program bugs (Zhang et al., 2018b; Islam et al., 2019; Humbatova et al., 2020; Xie et al., 2019; Ma et al., 2018b; Tian et al., 2018). As explained in Sections 1 and 2, DL program bugs are actually the incorrect usage of DL frameworks rather than the bugs in DL framework code. The latter is the target of our work. Furthermore, Garcia et al. (Garcia et al., 2020) studied the bugs of autonomous vehicles, which is a kind of DL-based applications and lies in the production level. Shen et al. (Shen et al., 2021) conducted an empirical study on DL compilers (e.g., TVM). Nejadgholi and Yang (Nejadgholi and Yang, 2019) studied the oracle approximation assertions implemented in DL libraries. Different from them, our work conducted a comprehensive study on DL framework bugs by investigating 800 bugs from four DL frameworks.

Besides the studies on investigating DL bugs, there are also many studies focusing on traditional software bugs in the literature (Lu et al., 2008; Di Franco et al., 2017; Han and Yu, 2016; Ocariza et al., 2013; Wang et al., 2020a; Thung et al., 2012b; Hoang et al., 2019). For example, Ocariza et al. (Ocariza et al., 2013) conducted a study on client-side JavaScript bugs. Lu et al. (Lu et al., 2008) investigated the characteristics of concurrency bugs. Different from them, our work targets DL framework bugs, which not only investigates the bug characteristics specific to DL frameworks, but also analyze the difference for the common bug characteristics (such as some common root causes) between DL frameworks and traditional software.

7. Conclusion

In this work, we conducted the most large-scale study on the characteristics (e.g., root causes, symptoms, and their correlations with DL-framework components) of DL framework bugs, where we manually analyzed 800 bugs from four popular DL frameworks and studied the current status of existing DL framework testing and debugging practice associated with those bugs. Through the comprehensive study, we summarized 14 major findings and provided a series of actionable guidelines for future studies on the detection and debugging of DL framework bugs.

References

  • S. Amann, S. Nadi, H. A. Nguyen, T. N. Nguyen, and M. Mezini (2016) MUBench: a benchmark for api-misuse detectors. In Proceedings of the 13th International Conference on Mining Software Repositories, pp. 464–467. Cited by: §4.1.2.
  • S. Amann, H. A. Nguyen, S. Nadi, T. N. Nguyen, and M. Mezini (2018) A systematic evaluation of static api-misuse detectors. IEEE Transactions on Software Engineering 45 (12), pp. 1170–1188. Cited by: §4.1.2.
  • C. Chen, A. Seff, A. Kornhauser, and J. Xiao (2015) Deepdriving: learning affordance for direct perception in autonomous driving. In

    Proceedings of the IEEE International Conference on Computer Vision

    ,
    pp. 2722–2730. Cited by: §1.
  • [4] (Accessed: 2021) Coverage.py. Note: https://coverage.readthedocs.io/ Cited by: §4.5.
  • [5] (Accessed: 2021) Deeplearning4J. Note: https://deeplearning4j.org/ Cited by: §1, §2.
  • A. Di Franco, H. Guo, and C. Rubio-González (2017) A comprehensive study of real-world numerical bug characteristics. In Proceedings of 32nd IEEE/ACM International Conference on Automated Software Engineering, pp. 509–519. Cited by: §6.
  • M. Du, F. Yang, N. Zou, and X. Hu (2020) Fairness in deep learning: a computational perspective. IEEE Intelligent Systems. Cited by: §4.2.2.
  • F. Ferreira, L. Lourdes Silva, and M. Tulio Valente (2019) Software engineering meets deep learning: a literature review. arXiv e-prints, pp. arXiv–1909. Cited by: §1.
  • J. Garcia, Y. Feng, J. Shen, S. Almanee, Y. Xia, and Q. A. Chen (2020) A comprehensive study of autonomous vehicle bugs. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, pp. 385–396. Cited by: §3.1, §3.2, §3.2, §6.
  • [10] (Accessed: 2021) Gcov. Note: https://gcc.gnu.org/onlinedocs/gcc/Gcov.html Cited by: §4.5.
  • I. J. Goodfellow, J. Shlens, and C. Szegedy (2015) Explaining and harnessing adversarial examples. In 3rd International Conference on Learning Representations, Cited by: §1.
  • [12] (Accessed: 2021) Gradle. Note: https://gradle.org/ Cited by: §5.1.2.
  • Q. Guo, X. Xie, Y. Li, X. Zhang, Y. Liu, X. Li, and C. Shen (2020) Audee: automated testing for deep learning frameworks. In 2020 35th IEEE/ACM International Conference on Automated Software Engineering, pp. 486–498. Cited by: §1, §4.5, §4.5.
  • X. Han and T. Yu (2016) An empirical study on performance bugs for highly configurable software systems. In Proceedings of the 10th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, pp. 23:1–23:10. Cited by: §6.
  • H. Hapke and C. Nelson (2020) Building machine learning pipelines. O’Reilly Media. Cited by: §4.2.2.
  • F. Hassan and X. Wang (2018) Hirebuild: an automatic approach to history-driven repair of build scripts. In 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE), pp. 1078–1089. Cited by: §5.1.2.
  • T. Hoang, H. K. Dam, Y. Kamei, D. Lo, and N. Ubayashi (2019) DeepJIT: an end-to-end deep learning framework for just-in-time defect prediction. In 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR), pp. 34–45. Cited by: §6.
  • N. Humbatova, G. Jahangirova, G. Bavota, V. Riccio, A. Stocco, and P. Tonella (2020) Taxonomy of real faults in deep learning systems. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, pp. 1110–1121. Cited by: §1, §1, §3.2, §6.
  • M. J. Islam, G. Nguyen, R. Pan, and H. Rajan (2019) A comprehensive study on deep learning bug characteristics. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 510–520. Cited by: §1, §1, §3.1, §3.2, §3.2, §4.2.2, §5.2, §6.
  • L. Jia, H. Zhong, X. Wang, L. Huang, and X. Lu (2020) An empirical study on bugs inside tensorflow. In International Conference on Database Systems for Advanced Applications, pp. 604–620. Cited by: §1, §6.
  • L. Jia, H. Zhong, X. Wang, L. Huang, and X. Lu (2021) The symptoms, causes, and repairs of bugs inside a deep learning library. Journal of Systems and Software 177, pp. 110935. Cited by: §6.
  • K. D. Julian, J. Lopez, J. S. Brush, M. P. Owen, and M. J. Kochenderfer (2016) Policy compression for aircraft collision avoidance systems. In 2016 IEEE/AIAA 35th Digital Avionics Systems Conference, pp. 1–10. Cited by: §1.
  • J. Kim, R. Feldt, and S. Yoo (2019) Guiding deep learning system testing using surprise adequacy. In Proceedings of the 41st International Conference on Software Engineering, pp. 1039–1049. Cited by: §1.
  • A. Kurakin, I. J. Goodfellow, and S. Bengio (2017) Adversarial examples in the physical world. In 5th International Conference on Learning Representations, Cited by: §1.
  • Y. Lou, J. Chen, L. Zhang, D. Hao, and L. Zhang (2019) History-driven build failure fixing: how far are we?. In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis, pp. 43–54. Cited by: §5.1.2.
  • S. Lu, S. Park, E. Seo, and Y. Zhou (2008) Learning from mistakes: a comprehensive study on real world concurrency bug characteristics. In Proceedings of the 13th international conference on Architectural support for programming languages and operating systems, pp. 329–339. Cited by: §6.
  • L. Ma, F. Juefei-Xu, F. Zhang, J. Sun, M. Xue, B. Li, C. Chen, T. Su, L. Li, Y. Liu, J. Zhao, and Y. Wang (2018a) DeepGauge: multi-granularity testing criteria for deep learning systems. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, pp. 120–131. Cited by: §1.
  • L. Ma, F. Zhang, J. Sun, M. Xue, B. Li, F. Juefei-Xu, C. Xie, L. Li, Y. Liu, J. Zhao, et al. (2018b) Deepmutation: mutation testing of deep learning systems. In 2018 IEEE 29th International Symposium on Software Reliability Engineering (ISSRE), pp. 100–111. Cited by: §6.
  • L. Ma, F. Zhang, M. Xue, B. Li, Y. Liu, J. Zhao, and Y. Wang (2018c) Combinatorial testing for deep learning systems. External Links: 1806.07723 Cited by: §1.
  • [30] (Accessed: 2021) MXNet. Note: https://mxnet.apache.org Cited by: §1, §2.
  • M. Nejadgholi and J. Yang (2019) A study of oracle approximations in testing deep learning libraries. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), Vol. , pp. 785–796. External Links: Document Cited by: §6.
  • [32] (Accessed: 2021) News. Note: https://www.vice.com/en_us/article/9kga85/uber-is-giving-up-on-self-driving-cars-in-california-after-deadly-crash Cited by: §1.
  • [33] (Accessed: 2021) News. Note: https://www.newsweek.com/autonomous-tesla-crashes-parked-fire-truck-california-freeway-789177 Cited by: §1.
  • F. Ocariza, K. Bajaj, K. Pattabiraman, and A. Mesbah (2013) An empirical study of client-side javascript bugs. In 2013 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, pp. 55–64. Cited by: §4.1.2, §6.
  • H. V. Pham, T. Lutellier, W. Qi, and L. Tan (2019) CRADLE: cross-backend validation to detect and localize bugs in deep learning libraries. In 2019 IEEE/ACM 41st International Conference on Software Engineering, pp. 1027–1038. Cited by: §1, §4.3, §4.5.
  • [36] (Accessed: 2021) PyTorch. Note: https://pytorch.org Cited by: §1, §2.
  • Q. Shen, H. Ma, J. Chen, Y. Tian, S. Cheung, and X. Chen (2021) A comprehensive study of deep learning compiler bugs. In Proceedings of the 29th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Note: to appear Cited by: §3.1, §3.2, §3.2, §5.2, §6.
  • C. Sun, V. Le, Q. Zhang, and Z. Su (2016a) Toward understanding compiler bugs in gcc and llvm. In Proceedings of the 25th International Symposium on Software Testing and Analysis, ISSTA 2016, New York, NY, USA, pp. 294–305. External Links: ISBN 9781450343909, Link, Document Cited by: §4.6.
  • C. Sun, V. Le, Q. Zhang, and Z. Su (2016b) Toward understanding compiler bugs in gcc and llvm. In Proceedings of the 25th International Symposium on Software Testing and Analysis, pp. 294–305. Cited by: §4.1.2, §4.6.
  • L. Tan, C. Liu, Z. Li, X. Wang, Y. Zhou, and C. Zhai (2014)

    Bug characteristics in open source software

    .
    Empirical software engineering 19 (6), pp. 1665–1705. Cited by: §3.2, §4.1.2.
  • [41] (Accessed: 2021) TensorFlow. Note: https://www.tensorflow.org Cited by: §1, §2.
  • F. Thung, S. Wang, D. Lo, and L. Jiang (2012a) An empirical study of bugs in machine learning systems. In Proceedings of 23rd International Symposium on Software Reliability Engineering, pp. 271–280. Cited by: §3.2.
  • F. Thung, S. Wang, D. Lo, and L. Jiang (2012b) An empirical study of bugs in machine learning systems. In 2012 IEEE 23rd International Symposium on Software Reliability Engineering, pp. 271–280. Cited by: §4.1.2, §6.
  • Y. Tian, K. Pei, S. Jana, and B. Ray (2018) Deeptest: automated testing of deep-neural-network-driven autonomous cars. In Proceedings of the 40th international conference on software engineering, pp. 303–314. Cited by: §6.
  • S. M. Vieira, U. Kaymak, and J. M. Sousa (2010)

    Cohen’s kappa coefficient as a performance measure for feature selection

    .
    In Proceedings of International Conference on Fuzzy Systems, pp. 1–8. Cited by: §3.2.
  • J.M. Voas (1992) PIE: a dynamic failure-based technique. IEEE Transactions on Software Engineering 18 (8), pp. 717–727. External Links: Document Cited by: §4.5.
  • P. Wang, C. Brown, J. A. Jennings, and K. T. Stolee (2020a) An empirical study on regular expression bugs. In Proceedings of the 17th International Conference on Mining Software Repositories, pp. 103–113. Cited by: §6.
  • Z. Wang, M. Yan, J. Chen, S. Liu, and D. Zhang (2020b) Deep learning library testing via effective model generation. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 788–799. Cited by: §1, §4.3, §4.5, §4.5.
  • M. Wardat, W. Le, and H. Rajan (2021) DeepLocalize: fault localization for deep neural networks. In 2021 IEEE/ACM 43rd International Conference on Software Engineering, pp. 251–262. Cited by: §1.
  • X. Xie, L. Ma, F. Juefei-Xu, M. Xue, H. Chen, Y. Liu, J. Zhao, B. Li, J. Yin, and S. See (2019) Deephunter: a coverage-guided fuzz testing framework for deep neural networks. In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis, pp. 146–157. Cited by: §6.
  • M. Yan, J. Chen, X. Zhang, L. Tan, G. Wang, and Z. Wang (2021a) Exposing numerical bugs in deep learning via gradient back-propagation. In Proceedings of the 29th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Note: to appear Cited by: §1.
  • M. Yan, J. Chen, X. Zhang, L. Tan, G. Wang, and Z. Wang (2021b) Exposing numerical bugs in deep learning via gradient back-propagation. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 627–638. Cited by: §1, §4.2.2.
  • J. H. Zar (2005) Spearman rank correlation. Encyclopedia of biostatistics 7. Cited by: §4.4.
  • A. Zeller and R. Hildebrandt (2002) Simplifying and isolating failure-inducing input. IEEE Transactions on Software Engineering 28 (2), pp. 183–200. Cited by: §5.1.2.
  • T. Zhang, G. Upadhyaya, A. Reinhardt, H. Rajan, and M. Kim (2018a) Are code examples on an online q&a forum reliable?: a study of api misuse on stack overflow. In Proceedings of 40th IEEE/ACM International Conference on Software Engineering, pp. 886–896. Cited by: §4.1.2.
  • X. Zhang, J. Zhai, S. Ma, and C. Shen (2021a) AUTOTRAINER: an automatic DNN training problem detection and repair system. In 43rd IEEE/ACM International Conference on Software Engineering, pp. 359–371. Cited by: §1.
  • X. Zhang, N. Sun, C. Fang, J. Liu, J. Liu, D. Chai, J. Wang, and Z. Chen (2021b) Predoo: precision testing of deep learning operators. In Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis, pp. 400–412. Cited by: §1.
  • Y. Zhang, Y. Chen, S. Cheung, Y. Xiong, and L. Zhang (2018b) An empirical study on tensorflow program bugs. In Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis, pp. 129–140. Cited by: §1, §1, §5.2, §6.