1 Introduction
This work is prompted by an investigation into output discrepancies between two large supercomputers running the Community Earth System Model (CESM™). Determining the reason for the statistically distinct model output in more than a million lines of code required equal measures of data analysis, climate science knowledge, experience with the code base, and intuition. The process took the combined expertise of many scientists and engineers and lasted months [milroy2016]. In this work we make significant progress toward automating root cause analysis for sources of error and discrepancy in CESM.
The CESM is a commonly used application for simulating the Earth system, and its influence extends from science to policy. The model’s Fortran code base is modular, which facilitates its evolutionary and community development. The CESM has grown to approximately 1.5 million lines of code, which contain expressions of modern coding techniques together with code written in its earliest versions (decades ago). CESM’s size, complexity, and continuous development make finding errors difficult. Furthermore, there are few tools designed for debugging large models written in Fortran. We focus on the CESM in this work, though our debugging methods may be applicable to other large Fortran models or with a different parser, models written in other languages.
The first step to finding sources of inconsistency is to identify abnormal output. A simple test like bitforbit equivalence is not useful because legitimate changes or optimizations to the model can result in bitwise differences between outputs. The works [baker2015, baker2016]
establish statistical testing for consistency with an ensemble of “accepted” output from the Community Atmospheric Model (CAM) and Parallel Ocean Program (POP) component models of CESM. These Ensemble Consistency Tests (ECTs) quantify the natural climate model variability present in an ensemble of the respective component models’ outputs. The ECT can then evaluate new, experimental outputs in the context of the ensemble to determine whether the new outputs are statistically consistent. While this test has been shown to work very well for correctly classifying new outputs, in the case of a failure it provides no information as to the causes. In this work we attempt to develop a path to providing this crucial information on root causes of errors.
This work is organized as follows: in Section 2, we overview our strategy and contributions and discuss related work. Section 3 describes identifying output variables most affected by inconsistencies. In Section 4, we detail transforming approximately 660,000 lines of code into a directed graph (digraph). In Section 5 we define our method of iterative convergence to locate sources of discrepancy, and in Section 6 we present examples of our method.
2 Overview and related work
In this section we provide a summary of the methods we develop and describe our principal contributions. We summarize related work on program slicing and runtime sampling.
2.1 Method and contributions
Each step of our method is motivated by reducing the search space of possible causes of discrepancy. We wish to identify differences between the ensemble and experimental outputs as early as possible, so we examine the model in its first time steps and run consistency testing at time step nine using UFCAMECT [milroy2018]. Using an early time step is an advantage for several reasons: bugs or discrepancies may not propagate changes through the entire model, climate feedback mechanisms may not yet take effect, and less of the source code is executed. Since CESM can be compiled in numerous configurations we begin by eliminating modules not built into the final executable. We use an existing code coverage tool to discard modules not yet executed by the second time step, and remove subprograms that are unused. Next, we focus on variables written to file that are most affected by the discrepancy, allowing us to disregard locations that compute other variables. These initial steps reduce the potential lines to search from about 1.5 million to 660,000, which is still substantial. From this reduced code base, we construct a digraph of variable dependencies expressed through assignment statements. We then extract from this graph a subgraph that computes the variables identified as affected by the discrepancy. To facilitate parallelism and runtime sampling (among other benefits), we use clustering to partition the subgraph. For each cluster, we rank nodes based on their centrality to determine which code variables to sample at runtime. We plan to further narrow the search space based on value differences between an ensemble and an experimental run, followed by clustering and sampling by centrality to converge iteratively on the sources of discrepancy (currently performed in simulation). Figure 1 is a schematic of our process.
We make the following contributions in this work: we create a pipeline to convert the CESM source code into a digraph with extensive metadata that represents variable assignment paths. We develop a hybrid static program slicing approach that efficiently returns large slices. We devise an iterative refinement procedure based on community detection, centrality, and runtime sampling to contract the slice to a useful size. We perform experiments based on CESM output that demonstrate finding the causes of model discrepancy. Finally, we provide evidence that our methods accurately characterize information flow at runtime.
2.2 Related work
Program slicing is a common technique in debugging and in software development and maintenance that extracts sections of a program that can affect a particular region of code [weiser1981, weiser1984]. In a broad sense, program slicing can be divided into two methods: static slicing, which considers all possible executions of a program, and dynamic slicing, which accounts for only one execution given a set of criteria (e.g., [tip1994, silva2012]). Static slicing is generally less expensive but can return slices that contain too many extraneous statements to be useful [bent2001]. Dynamic slicing can be far more precise but correspondingly expensive due to the inclusion of algorithms needed to evaluate the satisfiability of sections of the slice (such as SAT or Satisfiability Modulo Theory solvers [harris2010]). Socalled backward slicing considers subsets of code that affect a target location by backward traversal; it can be performed via static or dynamic slicing [jaffar2012]. We are not aware of any dynamic slicing methods that scale to models consisting of over a million lines of code. We adopt the strategy of hybrid slicing [gupta1995], which uses dynamic information about program execution to refine static slices. In our case, the dynamic information is provided by a code coverage tool.
Program sampling or instrumentation provides detailed analysis of program states by monitoring variable values at runtime. This type of monitoring can be used to detect divergent values of individual variables but can be extremely expensive (both in space and time) depending on the sampling frequency and the number of variables monitored. Many debuggers and profiling toolkits can perform sampling of large, distributedmemory applications (Allinea MAP and DDT [allinea2018], TotalView [totalview2018], and Tau [shende2006], to name a few), and tools such as FLiT [sawaya2017] and KGen [kim2016, kim2017] can detect divergent values at runtime. We seek to reduce the search space of CESM to the point that such tools (or those of future design) can identify specific variables that cause model divergence.
3 Identifying affected output variables
After UFCAMECT returns a failure, we identify CAM output variables that are affected (or most affected) by the cause of the failure. Doing so allows us to make a connection between the model outputs and the code itself. Ideally, we perform a normalized comparison of floating point values at the first model time step, selecting only those variables that exhibit a difference between a single ensemble member and a single experimental run. This approach is the most direct measure of difference, and we recommend using it first due to its simplicity. However, comparing floating point values is seldom useful for narrowing down the number of variables, since in most cases all CAM output variables are different at the model time step zero. For these cases, we instead examine properties of the variables’ distributions with two variable selection methods to identify those most affected by the discrepancy.
The first method measures distances between the distribution medians of the ensemble and experimental runs for each variable. To make meaningful distance comparisons across variables, we standardize each variable’s distribution by its ensemble mean and standard deviation. Then we identify variables whose interquartile ranges (IQRs) of ensemble and experimental distributions do not overlap. We then rank these variables by descending order of distance between their medians. Although this provides a straightforward ordering of variables, the disadvantage of this approach is that often many variables are identified. Our second method employs logistic regression with regularization via a penalized
norm (known as the lasso). We generate a set of experimental runs and use this in conjunction with our ensemble set to identify the variables that best classify the members of each set. We tune the regularization parameter to select about five variables as that yields a subset of CESM and CAM that, in our experiments, contains the known source of statistical inconsistency while still being small. The variables selected by the lasso (and their order) mostly coincide with the order produced by computing the distance between standardized medians. Variable selection for smaller or simpler models may present less of a challenge.4 From source code to digraph
Finding lines of code that modify a particular CAM output variable seems a straightforward task: use a textbased search to select code that modifies the variable in question. However, many internal variables may alter values that eventually propagate to the affected output values, and the data dependencies are likely to be complicated. To describe the relationships between CESM variables accurately, we convert each source code file into an Abstract Syntax Tree (AST), which represents code syntax as structural elements of a tree. From the ASTs we create a digraph which represents variable dependencies. Figure 2 provides a simple example of the transformation of source code assignments to a digraph.
4.1 Generating the AST
To construct the AST for CESM, we need to parse the source code. We use the same CESM version as in [kay2015], and our experimental setup (FC5) consists of a subset of all available component models. Before parsing, we do several preprocessing steps to exclude code that is not executed. Unfortunately, the CESM build system obfuscates which components’ Fortran modules are compiled into the specified model. Therefore, we employ KGen kim2016,kim2017, a tool to extract and run code kernels as standalone executables, to identify the files compiled into the executable model, reducing the number of modules from approximately 2400 to the nearly 820 used by our experimental setup. KGen also replaces preprocessor directives with their compiletime values, enabling conversion of Fortran code to a Python AST via fparser (based on F2PY [peterson2009]). Fparser is the only tool we are aware of to parse Fortran into Python data structures.
We further limit the scope of code considered by examining coverage, which identifies code lines, subprograms, and modules executed in a given application. Since our objective is to identify critical code sections as early as possible in the CESM runtime, we can ignore many subsections of code which are not yet run. To find such code, Intel provides a code coverage tool [intelcodecov2017] that writes profiling files that indicate coverage down to individual lines. In our experience, the tool returns accurate evaluations to the level of subprograms, but its behavior at the linelevel is inconsistent. Nevertheless, finding entire unused modules and uncalled subprograms is useful and reduces the number of modules and subprograms to be parsed by about 30% and 60%, respectively. We develop software to parse the codecov HTML output, using the output to remove unnecessary modules and comment out unused subprograms.
4.2 From AST to digraph
After converting each Fortran module file into an AST, we extract data dependencies to form a digraph. See Figure 3 for a visual overview. We need to resolve all assignments, as directed paths of assignments define dependencies between variables. Tracing dependencies between subprograms (similar to interprocedural program slicing [weiser1984]) requires processing subroutine and function calls, interfaces, use statements, etc. Assignments without functions or arrays are processed immediately. To allow correct mappings between call and subprogram arguments, parsing statements with calls must be done after all source files are read. Furthermore, Fortran syntax does not always distinguish function calls from arrays, so correct associations must be made after creating a hash table of function names.
Transforming the source code into a digraph presents several challenges. Fparser sometimes fails to convert a Fortran file into an AST due to bugs and statements that exceed fparser’s capabilities (e.g., one CESM statement consists of over 3500 characters). In fact, CESM contains thousands of expressions that are highly complex, with deep function and subroutine calls. Because existing Fortran parsing tools are inadequate for CESM, we employ three different parsers for each assignment (some are subjected to multiple passes of these parsers): fparser, KGen helper functions, and our custom string parsing tool based on regular expressions and Python string manipulations.
Processing the ASTs results in a metagraph Python class that contains a digraph of internal variables, subprograms, and methods to analyze these structures. CESM internal variables are nodes with metadata, such as location (module, subprogram and line) and “canonical name” (the variable name before being entered into the digraph  which requires unique node names). The digraph component of the metagraph is a NetworkX digraph [schult2008]. NetworkX is a Python graph library that provides an extensive collection of easy to implement graph algorithms and analysis tools.
With static analysis it is not always possible to determine which function a Fortran interface call truly executes at runtime. We adopt the conservative approach of mapping all possible connections. We map the target of use statements to their local names to establish correct local symbols for remote procedures, resolving Fortran renames. If the use statement does not specify an “only list,” we map all public variables in the source module to their target module variables. We do not consider chained use statements (i.e., where module A uses B, which uses C), since accurate dependency paths can be created by connecting the statements independently. With these associations defined, we iterate through statements containing subroutine calls and possible functions or arrays. We process subroutine and function calls by treating each argument as a tree, and we successively map outputs of lower levels to corresponding inputs above. Each output gets an edge to the above layer’s input, which injects the call’s graph structure into the CESM digraph. The top level argument output is connected to the subroutine’s corresponding argument in its definition. Discerning functions from arrays is addressed by hash table lookups in the metagraph. Ultimately, the expression’s righthandside variables and arrays and function (or subroutine argument) outputs are given edges to the lefthandside.
We adopt a conservative approach for handling composite and complex Fortran data structures. Arrays are considered atomic in that we ignore indices. Pointers are treated as normal variables. Fortran derived types are challenging, as they can be chained into deep composite data structures. We define the indexed element of the derived type as the metagraph canonical name, e.g., elem(ie) %derived %omega_p has a canonical name of “omega_p.” In effect, we are compiling the CESM Fortran source code into node relationships in a digraph. Note that our parsing is able to handle all but 10 assignment statements of the 660,000 lines of code in the coveragefiltered source.
5 Analyzing the CESM graph
We have transformed the CESM code into a digraph that is composed of nodes, which are variables present in assignment expressions, and directed edges that indicate the directionality of the effect of one variable upon another. Now we narrow the scope of our search for bugs by analyzing the graph, usually accomplished by program slicing. Static slicing often produces slices that are too large to locate error sources, and dynamic slicing, while more precise, is too expensive to apply to the CESM graph (about 100,000 nodes and 170,000 edges). Therefore, to make locating internal CESM variables or nodes that influence the values of the affected output variables more tractable, we examine static data dependency paths that terminate on these variables. We mitigate the imprecision of static backward slicing by integrating graph analysis algorithms to refine our slices. In this section, we discuss these methods and propose an iterative subgraph refinement procedure that involves runtime sampling of CESM graph nodes.
5.1 Tracing affected internal variables in the graph
Since variable relationships in assignment statements are represented as directed edges in the graph, we are interested in directed paths through CESM. These paths ignore control flow such as “if” statements or “do loops,” so this approach is akin to backward static slicing. A key difference between our approach and typical program slicing is that nodes in the graph are single variables rather than expressions of multiple variables. Slicing criteria are thus single variables. When used in conjunction with runtime information in the form of code coverage, our method can be considered hybrid slicing (e.g., [gupta1995]).
In NetworkX, the fastest way to determine dependencies is by computing shortest paths. In particular, we seek the shortest paths that terminate on output variables. Finding such output variables is a challenge in its own right. Ideally, we would find the locations where I/O calls are made with the output variables as arguments, and find all shortest paths in the graph that end on those calls. Considering these paths does not work well in practice because CESM subprograms that write derived types, e.g., state%omega usually take the base derived type (state) as an argument, rather than the derived type element (omega). This means that there are few paths that terminate on state%omega at the call location. We address this problem by searching for paths that terminate on nodes with the canonical name (see Section 4) of omega. This approach increases the size of our static slice, but with the attendant advantage that the bug source will very likely be contained in the slice.
CESM I/O statements use temporary variables extensively and include character type variables in the output name argument, so uncovering the exact variable output for a given I/O call must be done with custom instrumentation. Of the nearly 1200 CAM I/O calls which write output variables, many include variables to label the output. To resolve these variables, we instrument the code to print the corresponding string label, permitting a mapping between internal variable names and names written to file. For example, we do not search for paths that end on CAM output flds, but on variables whose canonical names are the internal name flwds.
So given a set of output variables that are affected by a certain change, we compute the shortest directed paths that terminate on these variables with Breadth First Search (BFS). After finding these paths, we form the union of the node sets of all such paths. We are interested in the union rather than the intersection as multiple disjoint code sections can be involved in the computation of an affected variable. Such a scenario can arise when conditionals dictate whether I/O calls are executed. Using the union of all shortest paths terminating on the internal canonical names of affected output variables, we induce a subgraph on CESM, which yields the graph containing the causes of discrepancy.
5.2 Community structure and node centrality
Since CESM and its component models are modular, it is reasonable to conclude that its graph should exhibit clusters corresponding to the modules or related processes. Induced subgraphs of CESM may contain cluster or community structure that can be exploited to improve our search for bug sources, which ends with sampling affected variables. Since sampling can be an expensive process, only a limited number of nodes in the subgraph should be instrumented. By partitioning the subgraph through community detection, we can choose a small number of highly connected nodes in each community to sample and perform the instrumentation of these nodes independently (in parallel). This process can be performed iteratively to reduce the search space.
CAM contains two main processes: physics (subgrid scale) and dynamics, which taken together feature a set of highly connected modules (the “core”). These CAM modules are involved in the computation of many of the output variables, and bugs are likely to affect multiple output variables. An examination of node connectivity in the core reveals clustering of highly connected nodes in different communities. Although sampling the whole core’s most connected nodes may detect floating point differences between ensemble and experimental runs, instrumenting highly connected nodes in each community instead can reduce the distance between instrumented variables and bug locations (reducing the number of iterations needed to refine the search space).
Centrality is a fundamental way to distinguish nodes in a graph. Two simple examples of centrality are degree centrality, which counts the number of edges connected to a given node, and betweenness centrality, which counts the number of BFS or Dijkstra shortest paths (for weighted graphs) that traverse a node (or edge). Graph analysis via centralities proves useful in many diverse areas of research, e.g., [freeman1978, salathe2010, shah2010, clauset2015]. A study of the relationship between brain regions’ centralities and physical and cognitive function [vandenheuvel2013] is particularly relevant to our work. They conclude that such analysis consistently identifies structural hubs (high centrality regions) in the cerebral cortex, and that “high centrality makes hubs susceptible to disconnection and dysfunction.”
The GirvanNewman algorithm (GN) [girvan2002, newman2004] is a popular method for identifying communities in undirected graphs. The algorithm is based on edge betweenness centrality, which ranks edges by the number of shortest paths (computed via BFS) that traverse them. The algorithm successively removes the edge with highest centrality in each connected component, which breaks the graph into ever smaller communities. GN identifies communities via the following steps [girvan2002]: 1. calculate the betweenness for all edges in the network; 2. remove the edge with the highest betweenness; 3. recalculate betweenness for all edges affected by the removal; 4. repeat from step 2 until no edges remain. In practice each iteration involves removing the edge with the highest betweenness until the number of communities increases [newman2004]. Note that GN was formulated to identify communities in undirected graphs. In our case, we convert the directed subgraph into an undirected subgraph for purposes of community detection. This conversion is desirable for our work, as it is equivalent to forming the weakly connected graph of the directed subgraph. Weakly connected graphs are digraphs where any node can be reached from any other node by traversing edges in either direction. Bug locations may be anywhere in the subgraph, so we cannot impose assumptions about whether instrumented nodes are reachable via bug sources in the digraph (even between communities) in either direction. However, in our experiments we know where the bug locations are, so we can simulate how our sampling procedure detects floating point differences between the ensemble and experiment. Given our knowledge of directed paths’ connectivity from known bug sources to central nodes, we can deduce whether a difference can be detected. For our method to be useful in situations where bug locations are unknown, we cannot assume such knowledge when we identify communities.
5.3 Finding important nodes with centrality
Given a modification that alters the values of a set of output variables, we seek locations in CESM that influence their computation. The CESM digraph lacks any information about the nature of connections between variables, so, for example, linear and exponential relationships are expressed identically in the graph. Indeed, the connectivity of the CESM graph is the only information we have to identify important locations in the code for sampling.
We use centrality to select nodes whose values are likely to be affected by the causes of statistical distinguishability. We can then sample the variables’ runtime values to detect differences between an experimental and a control (or ensemble) run. Eigenvector centrality is a promising choice, as it considers not only the degree of each node, but the degrees of its neighbors and their neighbors, and is related to information flow in a graph. In fact, eigenvector centrality is related to PageRank, which is used to rank web pages in search results page1999. In this work we focus on incentrality, as we seek nodes which are likely to be affected by the bug sources. From the perspective of sampling, we are looking for information sinks rather than sources. Eigenvector centrality has the disadvantage of favoring hubs (highly connected nodes), which “causes most of the weight of the centrality to concentrate on a small number of nodes in the network” for power law graphs martin2014. The degree distribution of the total CESM graph approximately follows a power law, as can be seen in Figure 4. Induced subgraphs of the CESM graph are also plausibly scalefree. A natural question is whether the concentration of centrality on graph hubs has undesirable effects on the ranking of nodes. We found that the application of nonbacktracking centrality (based on the Hashimoto matrix [hashimoto1989]) provides no advantage over standard eigenvector centrality for the CESM graph, its subgraphs, or communities. However, it may prove beneficial for models with graphs that follow a power law that produce more pronounced localization martin2014.
5.4 Iterative refinement procedure
Once communities are detected in the subgraph, we compute each community’s eigenvector incentralities and choose the top nodes to sample. The number of nodes to sample is dictated by computational resources. Based on whether a value difference can be found between the nodes sampled in the ensemble run and the experimental run, we can iteratively reduce the size of the subgraph to converge on the sources of statistical inconsistency. This iterative approach is similar to a ary search, which is a generalization of binary search. In binary search, the search space is halved and a single determination is made at each iteration, however for ary search the space is partitioned into sections and evaluations are made at each iteration. In our case varies by iteration depending on the number of communities identified. The following algorithm summarizes our overall approach:
Algorithm 5.4

Perform variable selection detailed in Section 3

Map the set of affected CAM output variables in step 1 to their internal CAM variables

For each affected internal variable , use BFS to find the set of nodes in all shortest paths that terminate on variables with canonical names equal to in the CESM digraph

Form the induced subgraph via the union of nodes in the paths in step 3

Use GN to identify the communities of undirected (omitting communities smaller than 3 nodes)

Compute the eigenvector incentrality for each and select nodes with largest centrality

Instrument for all in parallel for an ensemble run and an experimental run, noting the set of nodes which take different values ()


If (i.e., no different values are detected), form the induced subgraph on all nodes in that are not in BFS shortest paths that terminate on

Else, form the induced subgraph of generated by nodes in that belong to BFS shortest paths that terminate on


Repeat steps 58 until the subgraph is small enough for manual analysis or the bug locations are instrumented
There are three issues involved in the process above that merit discussion. First, it is possible that steps 58b in algorithm 5.4 do not refine the subgraph of the previous iteration i.e., if the subgraph connectivity is such that all nodes are connected to all central nodes that take different values between the ensemble and experimental runs. In this case, we can select a subset of the most central nodes “most affected” by the bugs. The second issue is that it is possible that the bug sources are not contained in any community i.e., if a bug is in an output variable that has only one neighbor. In this case, no different values will be detected in step 7, and the new induced subgraph will still meet the condition in step 8a. The next iterations will not detect differences, and the successive subgraphs will become increasingly disconnected. Eventually GN will not identify any communities, and the resulting nodes will need to be analyzed. The third issue is an artifact of static slicing: since the paths do not take into account, e.g., conditional branches, some of the paths may not be traversed. We need to develop a method to track edge traversal and remove invalid paths; algorithm 5.4 must only remove nodes that actually can influence in step 8a.
Unless otherwise noted, we perform only one iteration of GN in algorithm 5.4 step 5. We could use a larger number to further subdivide the induced subgraph in each iteration (possibly enabling more parallelism), but we adopt a conservative approach to avoid clustering the subgraphs far beyond the natural structure present in the code. Note that excessive GN iterations would not prevent algorithm 5.4 from locating bug sources, but it may slow the process.
6 Experiments
We apply the overall method discussed in Section 5.4 to several experiments. For all but one experiment, we introduce a bug into the source code so that the correct location is known. We then verify that our method can be used to identify the bug location in CESM by demonstrating how it would converge on the location given instrumentation. First, we show that our method can correctly identify straightforward singleline bugs before proceeding to more complicated sources of output discrepancy, such as the identification of variables most affected by certain CPU instructions. We make the following choices and assumptions in our experiments (unless otherwise noted): we restrict our subgraphs to nodes in CAM modules, perform a single GN iteration, choose the top 10 nodes by incentrality to sample, and assume all paths are traversed at runtime. Our method can iteratively locate bug sources if our restriction to variables in CAM modules is lifted, but the resulting search may require more iterations than restricting variables to CAM. In the figures that follow, subfigures a are the outputs of algorithm 5.4 step 4, 8a, or 8b, depending on the iteration or whether simulated sampling detects differences. Subfigures b color members of each community discovered by step 5, and subfigures c represent the output of step 7
for the community containing the discrepancy sources. Note that we use vector graphics for plots to encourage electronic copy readers to zoom in on features in each figure. Each following subsection describes a different experiment.
6.1 Wsubbug
We begin our testing with a bug in an isolated CAM output variable: wsub. By isolated we mean disconnected from the CAM core (see Section 5.2) and highly localized. Such a bug has minimal effect and scope which is a good sanity check for our method. The bug consists of a plausible typo (transposing 0.20 to 2.00) in one assignment of wsub in microp_aero.F90. The variable is written to file in the next line, so this bug affects only the single output variable. This small change produces a UFCAMECT failure. In this case the mediandistance method clearly indicates that the wsub variable is distinct; the distance between the experimental and ensemble medians for this variable is more than 1,000 times greater than for the variable ranked second. The induced subgraph contains only 14 internal variables, all of which are related to wsub, with one being the bug itself.
6.2 RandMt
This example, RANDMT, involves replacing the CESM default pseudo random number generator (PRNG) with the Mersenne Twister. This experiment appears in [milroy2018] as an example that results in a UFCAMECT failure. The random number generator is used to calculate distributions of cloudrelated CAM variables, and this experiment is interesting because it is not a bug and not localized to a single line. We identify the variables immediately influenced or defined by the numbers returned from the PRNG, and consider them to be the bug locations. The lasso variable selection method identifies the five output variables most affected by the PRNG substitution. From these variables, we extract a subgraph of 5,121 nodes and 9,755 edges. Given the size of this induced subgraph, we must use our iterative technique on subgraph communities to reduce the scope of our search. GN identifies two main communities (blue and green in Fig. 4(b)), in the CAM core. The smaller, green community contains the nodes computed using output from the PRNG. Instrumenting the top 10 most central variables in this community would not detect a difference, as there are no paths from the variables in the bug location to these nodes (see Fig. 4(c)). Creating the induced subgraph of all nodes not in shortest paths terminating on the most central nodes (algorithm 5.4 step 8a) admits a dramatic reduction in the search space (Figure 5(a)), which includes disconnected nodes and those with a single neighbor along the perimeter. Instrumenting the most central, orange nodes in Figure 5(c) would indicate a difference as there are multiple paths from the discrepancy sources. This subgraph is small, and the sources are sufficiently near the sampling sites that the cause could be found at this stage.
It is noteworthy that the induced subgraph does not contain all the source locations of the statistical distinguishability. The PRNG in CAM is called in two modules: one that computes cloud cover given longwave radiation, and the second with shortwave radiation. The combination of flwds (downwelling longwave flux at surface) and qrl (longwave heating rate) causes the longwave module to be present in the induced subgraph. However, the two variables that are needed to include shortwave radiation in the induced subgraph (fsds and qrs) are not in the set of first five variables returned by lasso.
6.3 Goffgratch
Our third experiment is a modification in the Goff and Gratch
Saturation Vapor Pressure elemental function.
We change a coefficient of the water boiling temperature from 8.1328e3
to 8.1828e3.
This easy to miss typo results in a UFCAMECT failure.
The output of the Goff and Gratch function
is used extensively in the CAM core, so its effects are not localized.
The lasso variable selection method selects 10 variables. Due
to experimentspecific conditions, tuning the
regularization parameter to select only five variables
would require a more sophisticated approach.
Inducing a subgraph on locations that compute these variables
results in a graph of 5,162 nodes and 9,873 edges (Figure 6(a)).
The largest community (blue in Figure 6(b)) contains the
nodes affected by the incorrect coefficient. Instrumenting
the top 10 most central variables in this community
would detect a difference, as there are paths from the
variables in the bug location to these nodes
(Figure 6(c)). Creating the induced
subgraph of all shortest paths terminating on these
central nodes, algorithm 5.4
step 8b returns a subgraph that includes
part of the green community from the first iteration.
Subsequent community detection reveals the remnants
of the green community of the first iteration,
which are then excluded by sampling. However, in
this case, no further simulated iterative
refinement can be performed by inducing
a subgraph on nodes connected to the
instrumented variables, as this subgraph is
so highly connected that the induced subgraph equals
the community subgraph. In this case, we can
rank the differences obtained by sampling and
further refine the subgraph based on the
nodes with the greatest differences.
6.4 Avx2
As noted in Section 1, this work is motivated by the lengthy, manual investigative process to find the source of statistical distinguishability between CESM outputs generated on the Mira [mira2018] and on Yellowstone [yellowstone2016] supercomputers (described in [milroy2016]). The discrepancy was determined to be caused by FMA instructions used by Mira and sparked interest in developing an automated process.
Using a method that measured each CAM output variable’s contribution to the CAMECT failure rate, affected variables were identified and located in the MorrisonGettelman microphysics module (MG1) [milroy2016]. KGen was used to convert this module into a kernel and to find variables which had substantially different RMS values between Yellowstone and Mira. One of these variables was nctend, which is modified by a frequently used temporary variable dum. Nctend also exhibits significantly different values between Yellowstone and Haswell generation (FMA capable) Intel CPUs.
Here we demonstrate that results culminating from months of work by CESM experts could be obtained by our automated method. Because we are unable to use Mira and Yellowstone, we evaluate FMA on Cheyenne [cheyenne2017]. Cheyenne contains Intel Broadwell CPUs, which support the Intel AVX2 instruction set, and these instructions include FMA. For our work, we compare an ensemble generated with AVX2 disabled (thus disabling FMA) to an experimental set generated with AVX2 (and FMA) instructions enabled. We verify that enabling AVX2 and FMA causes a UFCAMECT failure (see Table 1). Since FMA instructions can be generated from many different lines of source code (distributed sources of discrepancy), we employ KGen to identify a small number of variables affected by AVX2 and FMA to designate as bugs. We extract the MorrisonGettelman microphysics kernel identified in [milroy2016] and compare the normalized Root Mean Squared (RMS) values computed by the kernel with AVX2 disabled to the normalized RMS values with AVX2 enabled. KGen flags 42 variables as exhibiting normalized RMS value differences exceeding . Here, we determine if our iterative refinement procedure can find some of these variables given CAM outputs most affected by AVX2 instructions.
Inducing a subgraph on assignment paths that compute CAM output variables affected by enabling AVX2 instructions (selected by lasso) results in the graph in Figure 7(a) (4,801 nodes and 9,329 edges). Five of the 42 variables identified by KGen are present in this subgraph, all of which are in the blue community of Figure 7(b)
. This community contains the CAM core physics processes, of which MG1 forms a central part. The node with the largest eigenvector incentrality is the temporary, dummy variable
dum in Figure 7(c). Four of the five variables with normalized RMS values exceeding our threshold are in the top 15 nodes with the greatest incentrality. These variables are nctend, qvlat, tlat, and nitend. The fifth variable, (qsout), is modified by qniic (in the top 15 most central nodes) in an assignment statement. All five variables have paths that terminate on all 15 most central nodes. That our iterative refinement procedure would sample and identify the locations of nodes known to be most affected by AVX2 instructions on the first iteration is a testament to the potential utility of our method, particularly in the challenging case where hardware or CPU instructions cause statistical distinguishability.6.5 AVX2 in the CESM graph
Here, we deviate slightly to discuss how centrality can be used to identify Fortran modules crucial to information flow in the overall CESM graph. While the MG1 module and its constituent variables are causes of ECT failure with AVX2 and FMA enabled, these instructions can be generated in many CESM modules. This suggests we compute the (in and out) centrality of the modules themselves (rather than individual variables) to rank them by their potential to propagate FMAcaused differences within CESM. This viewpoint applies to other machine instructions or hardware errors.
To calculate the centrality, we must collapse the graph of variables into modules by considering the graph minor of CESM code formed by the quotient graph of Fortran modules. A graph minor is a subgraph of a graph obtained by contracting edges of . This graph minor identifies (or collapses) nodes using an equivalence relation, meaning that if two nodes satisfy the equivalence relation, they are replaced by a single node. Edges between equivalent nodes are deleted, and edges between remaining nodes and the new node are preserved. In this case we use the equivalence relation and are in the same CESM module (modules become equivalence classes). Applying this equivalence relation to the CESM graph yields a digraph of 564 nodes and 4,263 edges. Selectively disabling AVX2 on the top 50 modules ranked by centrality results in a substantial reduction in the UFCAMECT failure rate in comparison with AVX2 enabled on all modules. Furthermore, this approach exhibits a substantially lower failure rate than disabling AVX2 on 50 modules at random, and even the top 50 modules by lines of code. See Table 1 for the failure rates. These results indicate that eigenvector centrality accurately captures the information flow between CESM modules and provides a useful ordering. Selective disablement of instructions such as AVX2 balances optimization with statistical distinguishability and leads to more efficient CPU usage.
Experiment  ECT failure rate 

AVX2 enabled, all modules  92% 
AVX2 disabled, 50 random modules  85% 
AVX2 disabled, 50 largest modules  76% 
AVX2 disabled, 50 central modules  13% 
AVX2 disabled, all modules  2% 
7 Conclusions and future work
The goal of this study is to develop methods that make root cause analysis of CESM possible. To this end, we create a toolkit to convert the CESM source code into a digraph together with metadata that represents variable assignment paths and their properties. We develop an efficient hybrid static program slicing approach based on combining code coverage with BFS. We combine the GirvanNewman algorithm with eigenvector incentrality in series to enable runtime sampling of critical nodes. We perform experiments based on CESM output to demonstrate in simulation how our process can find causes of model discrepancy. Finally, we provide evidence that our methods accurately characterize information flow at runtime by successfully finding variables determined to be susceptible to FMA instructions. Creating a method to identify which variables to sample to refine the root cause search space is a significant accomplishment. However, developing and implementing a sampling procedure for the running model is a challenging undertaking that remains to be done. We note that creating a Python interface for LLVM [lattner2004] with Flang [flang2017] would allow our parsing to succeed on any compilable Fortran code and that integrating C/C++ capability through Clang [clang] is also desirable.
Acknowledgements
We acknowledge Dong Ahn, Ganesh Gopalakrishnan, Michael Bentley, John Dennis, and Sriram Sankaranarayanan for their helpful advice. This research used computing resources provided by the Climate Simulation Laboratory at NCAR’s Computational and Information Systems Laboratory (CISL), sponsored by the National Science Foundation and other agencies.
8 Supplementary material
8.1 Parsing
Subprogram call arguments can be as deep (in terms of, e.g., function composition) as the stack permits and composed of functions, arrays, strings, derived types, etc. We process such expressions by treating each argument as a tree, and successively map outputs of lower levels to corresponding inputs above. Each output gets an edge to the above layer’s input, which injects the call’s graph structure into the CESM directed graph. The top level argument output is connected to the subprogram’s corresponding argument in its definition. The complexity of discerning functions from arrays is addressed by constant time lookups in the metagraph class function hash table. An additional complexity of assignment statements containing functions is that the expression righthandside (RHS) can contain a large (compiler determined) number of functions. Thus processing expressions with many deep functions on the RHS can be expensive. Ultimately, the RHS variables and arrays and function outputs are given edges to the lefthandside. An example of nodeedge mapping within a composite function is provided by the following process, where each function’s internal variables will form a path connecting its inputs to its outputs, in order of depth:
and the output of the righthandside () gets a directed edge to the lefthandside ():  
8.2 Hashimoto nonbacktracking centrality
Scalefree or power law graphs which have degree distributions that are negative exponentials with exponent magnitude greater than 2.5 are identified as causing localization in martin2014. The degree distribution of the total CESM graph approximately follows a power law, as can be seen in Figure 9. Induced subgraphs of the CESM graph are also approximately scalefree, consistent with the properties of such graphs (see Figure 10 for the GOFFGRATCH experiment subgraph). A natural question is whether the concentration of centrality on graph hubs has undesirable effects on the ranking of nodes. The application of nonbacktracking or Hashimoto centrality [hashimoto1989] as a substitute for eigenvector centrality for power law graphs is discussed in [martin2014]. We compare the two centralities in Figure 11 for the GOFFGRATCH experiment. The Hashimoto nonbacktracking centrality indeed distributes the centrality from the hubs to other nodes, but the effect is subtle until approximately the ranked node. Also note that the Hashimoto centrality does not provide a rank for all nodes in the subgraph, as can be noted by the sharp drop at the end of its curve in Figure 11. This is due to the Hashimoto centrality’s use of the line graph of the subgraph’s adjacency matrix, which excludes nodes with no neighbors. Although we determine that the nonbacktracking centrality provides at best marginal improvement over eigenvector centrality for our graph, we provide a derivation based on that which appears in martin2014. Hashimoto centrality may prove beneficial for models with graphs that follow power laws that produce more pronounced localization martin2014.
8.2.1 Centrality derivation
This section is a reformulation of the derivation in [martin2014], which we
have reworked in the interest of clarity.
Let be a graph with adjacency matrix , set of nodes and edges .
The graph order is the number of nodes: , while the graph size is the number of
edges: . Note that is the number of nonzero entries in if is
directed, and the number of nonzero entries in the upper or lower triangle of
if is undirected.
Let be represented as . If is directed,
and the order represents direction: . If
is undirected, .
Let be the set of neighbors of node .
The Hashimoto, or nonbacktracking matrix [hashimoto1989] of graph is denoted and is an adjacency matrix on :
Where is the Kronecker delta. For an undirected graph, each
becomes two ordered pairs
. Thus is if is directed, and for undirected . is closely related to the line graph which is also an adjacency matrix on :Instead of computing the eigenvector centrality on , we use . Let
be the PerronFrobenius (leading) eigenvalue of
, and be the corresponding eigenvector. Then the outcentrality (corresponding to outedges) for some can be derived by starting from the eigenvector equation . To compute the incentrality used in this work, we can reverse the directed edges of (via the transpose ).or  
then the full nonbacktracking centrality of node is:  
Where we are free to choose a constant to normalize the centrality.
8.3 Additional experimental results
Note that we refer to the GirvanNewman algorithm ([girvan2002, newman2004]) as GN.
Experiment  Output variables  Internal variables 

WSUBBUG  wsub  wsub 
RANDOMBUG  omega  omega 
GOFFGRATCH  aqsnow, freqs, cldhgh, precsl, ansnow, cldmed, cloud, cldlow, ccn3, cldtot  qsout2, freqs, clhgh, snowl, nsout2, clmed, cld, cllow, ccn, cltot 
DYN3BUG  vv, omega, z3, uu, omegat  v, omega, z3, u, t 
RANDMT  flds, taux, snowhlnd, flns, qrl  flwds, wsx, snowhland, flns, qrl 
AVX2  taux, trefht, snowhlnd, ps, u10, shflx  wsx, tref, snowhland, ps, u10, shf 
8.3.1 Goffgratch
This supplementary section includes both the first and second iterations of GOFFGRATCH (Figures 12 and 13, respectively). The second iteration of this experiment does not appear in the paper.
8.3.2 Randombug
We select the module for this bug by randomly choosing a module from the set of CAM modules known to be executed by our simulation in the first time step. We introduce an error in the array index of a variable used to assign the contents of the derived type containing physics state variables (t, u, v, etc.), in particular the state variable omega. As in the previous experiment, this change results in a UFCAMECT failure. Omega is output to file with the value state%omega, so we use “omega” as the canonical name for generating the induced subgraph. This experiment is more challenging than WSUBBUG, as omega is computed in other CAM modules, yielding a subgraph of 628 nodes and 295 edges. Applying the GN algorithm to the remaining nodes identifies several small (fewer than 30 nodes) communities, one of whose most central node is the bug source. See Figure 14.
8.3.3 Dyn3bug
Another example of a bug consisting of a single line change is located in a dynamics subroutine that computes hydrostatic pressure in the CAM core. The bug particularly affects the five variables listed in Table 2. We apply our iterative refinement to the induced subgraph of 6,017 nodes and 11,512 edges (Figure 14(a)), and successfully separate the red dynamics community from the blue physics community. Instrumenting the light blue, most central nodes in Figure 14(c) would detect a difference in values between ensemble and experimental runs, as at least one instrumented node is reachable from the bugs. Inducing a subgraph on nodes contained in paths terminating on the central nodes connected to the bugs further reduces the size of the subgraph. In this way the subgraph will become small enough to instrument all nodes, or the exact bugs will be sampled.
8.3.4 Avx2
Figure 17 is an assertion that restricting our induced subgraph nodes to variables present in CAM is not necessary. This subgraph is created with the same affected variable list as Figure 17, but allows nodes outside of CAM (such as in the land model). Although the graph is larger (5,162 nodes and 9,873 edges), it manifests the community structure of the CAM core (orange cluster). Note that these communities are produced from two divisions of the GN algorithm rather than our default one, as the models’ structure is less evident with a single division. This suggests that the same conclusions are reached with this subgraph after a greater number of iterations.
8.4 Centrality output examples
In this section, we provide incentrality output to corroborate our assertion in paper Section 5.2, namely “Although sampling the whole core’s most connected nodes may detect floating point differences between ensemble and experimental runs, instrumenting highly connected nodes in each community instead can reduce the distance between instrumented variables and bug locations (reducing the number of iterations needed to refine the search space).” In the supplementary file avx2subgraphincentrality.txt, we include the output of the eigenvector incentrality computed for the AVX2 subgraph. (Supplementary files are available from the authors by request.) The output lists the top 500 most central nodes in the subgraph as (node, centrality value) tuples in descending order of centrality. Note that the node name displayed appends the name of the subprogram containing the variable to the name, which ensures unique node names in the graph. For example, the top node by centrality is dum__micro_mg_tend, which corresponds to dum in the micro_mg_tend subroutine (part of the MG1 microphysics kernel). There are a substantial number of nodes that appear like min_line#__subprogram. These nodes are Fortran procedures like min or max, which we introduce into the graph by creating paths from their inputs to themselves, and then to their outputs. This ensures correct dependency paths for these intrinsic procedures. Nearly all of the top 500 most central variables are located in the MG1 kernel. Nodes in MG1 mask the centralities of nodes in other modules, or clusters of modules.
Supplementary file avx2community1incentrality.txt lists the tuples in descending order of centrality for the first GN community of the AVX2 subgraph. Notice that these tuples are very similar to those seen in the centrality output for the entire AVX2 subgraph. However, the second GN community of the AVX2 subgraph (avx2community2incentrality.txt), contains nodes not in MG1 or closely associated modules. In fact, the top nodes in the second community are not present in the top 500 nodes in the AVX2 subgraph, or in the first community. This supports our choice of partitioning the subgraph into communities with GN, since we can search multiple possible connected clusters in parallel. Furthermore, if the sources of output discrepancy for AVX2 were in the second community, our sampled nodes would be closer to the sources. For the AVX2 experiment the sources are the most central node (and several others in the top 15), but for DYN3BUG (Section 8.3.3) the bugs are in the second community. This can reduce the number of iterations necessary to converge on the discrepancy sources.
Comments
There are no comments yet.