Technology scaling has slowed and continued increases in application-performance must look for innovation beyond an improved switching device. Thus, innovation in architecture has focused on deriving higher throughput-performance from the energy-efficiency of specialized hardware. This specialization has driven system diversity, with modern Systems on Chip (SoCs) being populated by a heterogeneity of accelerators. Consequently, migrating a software-base to such a system is labor intensive, requiring a programmer to map and optimize each stage of an application-pipeline to a specific hardware unit. In this paper we present a tool that allows us to detect and describe an application’s parallel kernels, which would facilitate both the selection-of and mapping-to heterogeneous accelerators.
Applications are composed of a producer-consumer interconnection of kernels  (section 3). Parallel kernels are a semantic collection of basic blocks that are clustered temporally, recur many times, and can be executed simultaneously without distorting the result. They are composed of various programmatic structures which include loops, recursion, or library calls. They dominate the execution time of an application. Kernels account for over 99.9% of all code executed in our test applications (figures 8 and 9).
Current approaches for detecting kernels usually require that code be preformatted to make kernels explicit. High Level Synthesis (HLS) relies upon hand annotated code with compiler directives indicating kernel structures. Domain specific languages (DSLs) like CUDA and Halide function as wrappers that put kernels into a format that labels them explicitly. Static analysis techniques such as polyhedral analysis detect some kernel types but often struggle to find recurring structures that aren’t written as a loop with static boundaries. These limitations mandate that input code for polyhedral analysis be written with static analysis tools in mind. Detecting kernels in naive code is currently not viable for most programmatic structures. Dynamic traces are powerful and can enable better program optimization or profile an application to enable introspection of the computation (Dissegna and Pin); however, dynamic traces have not been used extensively thus far due to their relatively high expense in trace time, trace size, and analysis time. To fully analyze naive code, a new approach is needed.
We have developed TraceAtlas, a tool that enables the tracing of an application with a time dilation factor of only nine. TraceAtlas can also detect kernels from naive, unformatted code with only ten megabytes of memory. The detected kernels can then be analyzed to find the true producer-consumer relationship between kernels based on execution rather than expression. Figure 1 demonstrates the entire pipeline for TraceAtlas to take source code and label the producer-consumer relationship between kernels. First, LLVM IR is annotated to produce a dynamic application trace. This trace is then analyzed temporally to determine the affinity between basic blocks; the detected collections are then compared to the execution of the trace to produce kernels. Finally, the memory accesses of the kernels within the trace are analyzed to determine their producer-consumer relationships.
We present a method to extract parallel kernels for unannotated LLVM-compatible languages using dynamic trace analysis. Such a tool would support innovation across many use-cases. The dynamic trace is created by selectively emitting compressed LLVM-intrinsics at run-time (section 4). We have developed an algorithm to detect the kernels from the input application trace using a log space algorithm that permits reading of the trace in minutes, (section 5). The results of our trace improvements and kernel extraction are available in section 6.
We evaluate our approach by analyzing four different design methodologies used to write kernels: For loops, recursion (section 7.1), kernel libraries (section 7.2), and interface libraries (section 7.3). In addition, we compared the results of our analysis of a radio Orthogonal Frequency-division Multiplexing (OFDM) system to the kernels the expert predicted and achieved identical results (section 7.4).
Our tracing technique has a speedup of 230 times over naive implementation and 49 times over current state of the art. Our algorithm has successfully identified over 10,000 kernels in 392 applications from 16 libraries. We discuss the limitations of our approach and enumerate some of the potential uses for our tool’s dynamic traces and kernels extracted in section 8.
The contributions of this work are:
TraceAtlas111Available from CodeOcean at https://codeocean.com/capsule/2cb73b4e-11f9-4547-8fe3-4b4956d3d251/ and GitHub at https://github.com/ruhrie/TraceAtlas
: an open source tool for fast-in-time and small-in-space dynamic tracing of applications
A logarithmic space, linear time algorithm for the extraction and legalization of kernels from a dynamic application trace that can approach terabytes of data
A formal definition for parallel kernels which encapsulates most other kernel definitions while describing some of the important kernel attributes
2 Background and Motivation
Kernels represent the vast majority of the program computation and are one of the most important code segments to optimize. A kernel is composed of a semantic collection of basic blocks in a program that are clustered temporally and recur many times. Currently, kernels are only detectable if the programmer wrote their code in a very specific way that modern tools expect. Because of this, modern kernel-based tools require kernels to be written explicitly, either through code annotations or using DSLs. The objective of this work is to demonstrate a method of detecting and extracting kernel code from unformatted code. Dynamic tracing would allow for simple detection of these kernels, but current techniques are inefficient to the point of failure on larger programs and often cannot trace the entire program. To dynamically detect kernels in an application, dynamic tracing is a promising technique, but it has received limited use thus far.
The basic blocks that compose a kernel are primarily hot code. By optimizing the hot code, large performance improvements can be made. Greendroid utilized a hot code detector to identify the portions of code which represented large portions of the energy consumption. By identifying and transferring this computation to specialized hardware, they were able to save significant energy. This hot code forms the seed of a kernel in our approach and can be used to identify temporally recurring code segments.
To discover kernels, it is necessary to have a formal definition. Numerous projects have built independent custom DSLs that have a kernel definition specific for their type of kernels. Gramps and Halide focus on image processing kernels. Regent emphasizes the types of kernels found in high performance computing. RAPID contains kernel structures that are specific to FPGA syntheses. PaRSEC specializes in scheduling streaming kernels on heterogenous architectures. StreamIt calls every kernel a filter and creates output data by pulling it from the output, requesting each producer in the chain to create the required data. This is just a small subset of DSLs that have been developed to represent kernels. All of these techniques ask the user to write the body of the kernel explicitly as well as some metadata that specifies how they are connected. A holistic kernel definition must bridge these definitions.
The discovery of parallelism in an application is currently focused on the extraction of individual loop-parallelism through code annotation and formatting. Polyhedral analysis tools like LLVM-poly are able to detect loops in code through static code analysis; however, techniques such as these are limited to loop detection and will require that the loops be written in a way that the tool recognizes. HLS works around this by requiring that all loops of interest be manually annotated. This technique allows the programmer to specify exactly what should happen to their code, with the caveat that the user must know how to optimize the code themselves and what kernels are in their code.
Current static analysis tools fail to detect several important variants of kernels. Dynamic loop boundaries make it impossible to know how parallel a loop is without fully executing it. Recursion is another example where the number of recursions is not inherently obvious and may be difficult if not impossible to determine. An FFT for example will recurs down times and this behavior cannot be determined without knowing n in advance, something that may or may not be known at compile time.
A dynamic trace is a history of the computation that includes what operations occurred in what order. Aladdin created dynamic traces from C applications to predict the performance of a custom-built ASIC based upon its C implementation. By analyzing everything the application executed, accurate predictions about the expected cost to perform the computations were able to be made. Wasabi is another dynamic tracing tool that created traces from web assembly code to enable further static analysis techniques. Aladdin is only able to detect single kernels from an application, requiring that the source code be prepared in advance. Aladdin also takes a long time to trace larger applications due to a lack of tracing optimizations. Wasabi improves upon this by tracing all input code and having a time dilation factor of only 70, but they only support high level WASM, not lower level operations.
Dynamic traces have seen some limited use as an extension to JIT. If kernels are labeled in advance, a minimal dynamic trace that primarily stores address information can be used to detect dependencies. This currently relies upon having the kernel labels available in advance, and without this information. it must all be held in memory. This makes JIT analysis infeasible for larger applications and larger kernel optimizations.
Once kernels are known, a myriad of static program optimizations can be accomplished. The simplest optimization is to reformat the kernels into an alternative that is more easily digested by polyhedral analysis
. The extracted kernels can be identified as a target for optimization by genetic algorithms. Similarly, because the kernels represent a large proportion of the computation, they are an optimal target for approximate computing and can achieve the best improvements. Kernels can also be compiled at this point to run on a more efficient, compatible architecture that specializes in kernels like a GPU or TPU. Kernel classification through ML inference can also direct compiler optimization or the direct exchange of naive code for expertly tuned library calls such as FFTW. Automatically detecting kernels will also enable identifying emergent classes of kernels that have not yet been expertly categorized.
3 An Overview of Kernel Properties
Kernels are a semantic collection of basic blocks that are clustered temporally and recur many times. Basic blocks represent unconditionally executed code sequences, thus membership in a kernel applies either to all or none of the instructions within. Sub-programs (i.e. functions and inner-loops) are composed of multiple basic blocks executed in sequence, and a kernel is represented by their graphical collection. Given the sequential execution of these sub-programs, basic blocks that are executed close-in-time are likely to be members of the same kernel.
Kernels are executed many times in the course of a program; therefore, basic blocks that compose them must also recur many times. To be functional they are not composed exclusively of high frequency basic blocks but also low frequency blocks that occur occasionally in the path of the computation. Boundary conditions and special case control flow will rarely occur but are still part of the kernel. This is functionally different from hot-code, which is composed of basic blocks that have been labeled “hot” for executing a sufficient number of times. Kernels, should they lack loop-loop dependencies between instances, can be scheduled entirely in parallel or a hybrid of parallel and sequential operation. Most kernels can be rewritten to remove these dependencies making them intrinsically parallel. Finally, kernels have a semantic meaning, representing a block of computation with specific function and a produce-consume relationship with peer kernels.
To summarize, a kernel is an amalgamation of several different code properties. Specifically, kernels are composed of basic blocks that:
are grouped to form collections that are semantically related in purpose
have a high probability of being temporally adjacent in execution
are executed many times
Our definition captures all parallel kernels defined by Asanovic et al. . For-loops, the prototypical kernel, contain the temporal adjacency between the basic blocks and occur as many times as the loop. Recursive function calls, an atypical kernel configuration, contain the same temporal adjacency and recur for the recursion depth. Another kernel configuration, a task scheduler (not included in this work), will execute individual kernels for smaller iteration counts, interleaving the execution of kernels. As long as the kernels are not consistently executed in the same order, the basic blocks in this design layout still maintain the temporal affinity and recurrence behavior of kernels. Should they be consistently scheduled in the same order, they will appear as a single fused kernel. Combinatorial logic will only be detected as a kernel should it be deemed as hot code, so it is either not a kernel by our definition or is contained within a loop of some description. Finite state machines are composed of two primary components: a state controller and the state actions. These will be detected as two different kernels rather than one if the FSM is composed of hot code.
Figure 2 contains an example decision-to-decision path (DD-path) which contains two kernels. The red kernel is composed of basic blocks B, C, and D which are repeated many times over. The blue kernel is later in the pipeline and consumes data from the red kernel. It is composed of basic blocks F, G, and H. These two cycles are prototypical kernels and arise from a loop (red kernel) and a recursion (blue kernel).
This definition of kernels allows for one kernel to be contained within another. Since a kernel is a collection of basic blocks that recurs, should there be another recursion within the kernel, such as within a nested for-loop, the inner kernel will be composed of a subset of the basic blocks within the parent. Additionally, this is a one-way relationship. A kernel contains child kernels, but a kernel can be called in multiple contexts and thus have many parents. An example of this would be an FFT. Traditionally an FFT8 contains two FFT4s and a twiddle kernel. FFT4 and the twiddle are sub-kernels nested within FFT8, but the FFT4 can be called from another kernel as well. This creates a hierarchy of kernels from the outside-in with the outermost kernel being responsible for scheduling and memory operations and the inner kernels recurring over many input data sets.
Figure 4 contains an example DD-Path for a computation containing two kernels, one nested within another. The code originates from the SHOC benchmark and was acquired from Aladdin’s github page. Each basic block in the DD-Path is assigned a color which corresponds to the highlighting color on the right. Kernel A specifically only contains the looping part of the inner for loop, , , and . Single execution parts of the for loop are excluded because they run only once. Kernel B contains all the basic blocks from Kernel A, thus the blocks of Kernel A are guaranteed to be a subset of the blocks in Kernel B.
4 Low Overhead Dynamic Traces
Dynamic tracing with current techniques is prohibitively expensive in both runtime dilation and disk utilization. A naive trace which only stores executed instructions produces a large amount of data, often exceeding 100 gigabytes for a single-second program. A few simple, high-level optimizations dramatically ameliorate this problem. This paper evaluates the following optimizations:
Compressing the trace data with Zlib
Clustering the trace writes so lower os overhead
Encoding trace information before it is compressed with Zlib
These optimizations allow us to shrink the execution time to a runtime factor of nine and produce a trace that rarely exceeds five gigabytes. Table I demonstrates the relative improvements and cost of each optimization.
|Zlib (SoA)||1.704||438.1||Compress output|
TraceAtlas generates dynamic traces by injecting logging statements into the LLVM IR. The naive solution is to simply export the IR as the application executes. This generates a trace in plain text and is relatively robust; however, applications today run billions of operations a second. By exporting all information with no processing, the disk always becomes the bottleneck in both write speed and storage size.
Trace data can be compressed as it is generated by Zlib. This is the current state of the art solution used by Aladdin. Trace information is extremely low in information and can thus achieve exceptionally high rates of data compression. This optimization moves the bottleneck from disk write speed to processor execution. Due to the high repetitiveness of the data, Zlib was able to easily identify the vast majority of the repetition in the traces. Different compression levels did not appear to appreciably impact trace size or time. Overall, this optimization was found to double the execution time, but shrink the trace size by 20x on small applications. Larger applications dramatically increase the runtime due to a Zlib operating better on smaller kernels. We do not have numbers on this due to the inability of naive tracing to produce a trace that fit on the hard drive, but intermediate applications had an overhead ranging from 500-2000x relative to the original execution.
Exporting information to the trace as soon as it becomes available is relatively expensive due to the system call overhead. There are two primary overheads in dumping the trace information: the call overhead to export the information itself and compressing the data with Zlib to write it to disk. Since basic blocks always run in order, it is possible to only dump all basic block trace data at the end of the trace. This removes many potential export calls that scale with the size of the basic block. Similarly, Zlib compresses most efficiently when it has a relatively large chunk of data to process. By waiting until there are 4-128 kb of data available lowers the cost of using Zlib to compress the trace. The combined effect of these two optimizations resulted in a 25% trace time improvement.
Ultimately, Zlib compresses the trace such that the basic blocks are being encoded. This information is known at compile time, and if we do this manually, we can get significant performance improvements. By assigning a key to every basic block and exporting that information instead of the actual IR, the compression effort is significantly reduced. This efficiently encodes the trace before it is given to Zlib while maintaining critical information on the structure of the computation. The result was that the trace size halved again while the total time to trace fell to an overall time dilation of nine.
Depending on the demands of your application, additional information may be required beyond the path of the execution that add some additional overhead. For our purposes, we were interested in looking at the memory dependencies between kernels in order to extract the producer-consumer relationships between kernels, see section 8. In order to do this, we also exported the addresses of all loads and stores. This resulted in the trace time rising dramatically and a significantly larger trace due to Zlib having more difficulty compressing the information due to how addresses vary. Aladdin’s implementation exports the addresses of all load and store operations, but an in-house variant was written with support for additional features. As a result, addresses are not represented in the numbers reported in Table I. Additional trace values can also be exported for an additional overhead.
TraceAtlas has minimal overheads and can trace every application with LLVM IR. Table I contains an overview of our performance improvements with the average trace size and average time dilation factor for all applications that were able to be traced naively with less than a terabyte of disk and traced with Zlib in under 48 hours. The cumulative effect of all our optimizations resulted in a runtime speedup of 400x on small programs versus SoA and over 100,000x for larger programs. Trace sizes halved for smaller applications and 1500x for large programs. These levels of overhead put execution time of an application to nearly real time. As a result, traces can be generated quickly with minimal data stored. This alleviates the costs of dynamic tracing and makes it a viable tool to perform additional optimizations, such as the identification of kernels.
5 Kernel Extraction
A kernel is a subsection of a program that executes many times throughout the lifespan of a program. This subsection gets transformed into basic blocks as part of the compilation process. As a result, a kernel can be represented as a collection of basic blocks that are connected in some type of looping structure. This may include a standard loop, a recursive function, some other cyclic structure in their DD-Path, or a combination of these three.
Detecting these kernels is a multi-step process. The first step is to cluster the basic blocks from the source application into a series of temporally adjacent basic blocks using a greedy, heuristic algorithm (section5.1). Once segmented, these collections will potentially contain duplicate kernels or be missing individual basic blocks. To fix this a smoothing algorithm must be applied to remove these errors and transform them into kernels (section 5.2).
5.1 Kernel Detection
The defining attribute of kernels is that there is a high degree of temporal affinity between basic blocks within the kernel. If a single basic block from a kernel within a trace is examined, there is a high probability that the prior basic blocks and subsequent basic blocks will also be a part of the kernel. We refer to this as the basic block affinity which can be most easily calculated as . This is the probability that basic block occurs times within a distance of . Distance refers to the number of basic blocks that are executed between the current basic block and the target basic block.
More generally, basic block affinity is the probability that any one basic block occurs within a range of another. Our approach uses a uniform weighting over an execution window of seven basic blocks, but other PDFs can be used to achieve similar results. A window refers to the range of basic blocks that are analyzed during the computation. Our results in section 6 informed our choice of window width.
Using this affinity metric, we can calculate a score between any two basic blocks by summing over the size of the window as is done in equation 1. The final score is subsequently divided by the size of the window to make the final scores mimic a probability. A kernel will then be defined as a summing of basic blocks such that the mutual sum of the score of all basic blocks in a set are greater than a certain threshold probability.
This approach which is summarized by equation 1 is advantageous because it only requires storing the previous basic block IDs in memory. Due to the potential of traces to contain billions of basic blocks or more, a logarithmic space algorithm is mandatory to analyze the trace. This approach scales linear with the number of basic blocks in the source application and logarithmically with the size of the trace. Since the trace size is dramatically larger than the number of basic blocks in a typical application, equation 1 satisfies this constraint.
With the affinity scores calculated, one can then represent the collection of basic blocks within a program as a fully connected digraph with edge weights equal to the given score. Figure 3 contains an example set of values where each edge is the maximum of the two edges between the nodes. Within Figure 3, each kernel is composed of a set of basic blocks such that the sum of the weight from every node within a kernel to every other node within a kernel sums to at least the target threshold, 0.95 in this scenario.
The structure of this affinity allows us to only store the ID of the last basic blocks. By using this small amount of memory, we can increment counters for each basic block within a range the basic block at memory location . This generates the affinity numbers, . This algorithm only utilizes logarithmic space and linear time, allowing for quick analysis of traces that can exceed terabytes of raw data.
Figure 5 presents a method for calculating the sum given in equation 1 with minimal memory. As a trace is streamed through, commit the prior blocks to memory. A counter in a matrix is then incremented at row and every value in including . Finally each value is normalized by the number of times that block occurs.
With the affinity score between basic blocks calculated, we can segment the graph by utilizing a greedy algorithm as given in Figure 6. To do this, the algorithm iterates through basic blocks, using them as a seed to create kernels. We first sort the basic blocks by frequency count in decreasing order. Then for every basic block that hasn’t already been added to a kernel is used as a seed of a new kernel, and we greedily add the basic block with the highest affinity score to the new kernel until we reach a desired threshold, currently 0.95. The resultant collection of basic blocks will be such that there is at least a 95% chance that you will see another block from the kernel for every instance of a block in the kernel. Upon reaching the threshold each of the blocks within the kernel gets added to a set of explained blocks which are no longer valid seeds for new kernels. Finally, once the count is under the hot code threshold (512 in this case) the algorithm terminates. Formal graph cuts were found to not be necessary to properly segment the computation.
To get more consistent results, it is advantageous to sort basic blocks in decreasing order. Nested kernels are guaranteed to have a higher execution count than that of the parent kernels. By sorting the seed basic blocks by execution count, we guarantee that we will find kernels from the inside out and will not skip detecting a nested kernel because we identified the parent first.
It is necessary to exclude certain basic blocks from being seeds because the affinity scores can result in a distinct set of blocks that do not form a kernel, specifically in nested kernels. If a basic block from a parent kernel is selected as a seed and it is on the interface between the child and parent kernels, it is possible for the affinity to be slightly higher with the child than with the parent. This will result in a new kernel that is the child kernel with this basic block on the interface included. It is distinct from both the parent and child kernels and cannot be removed with the algorithm in section 5.2. Fortunately, not every basic block needs to be used as a seed, but rather only enough blocks to explain all the hot code. The basic block collections that are produced from this algorithm have the potential to have holes in them or even detect duplicate kernels. To resolve this, a second stage algorithm must be used to refine the kernels into a more useful form.
5.2 Kernel Legalization
The basic block collections detected in section 5.1 are only an abstract set of basic blocks that have a high probability of being temporally adjacent. This definition, although it takes advantage of some of the features of kernels, is mathematical and fails to properly represent the functional aspect of the source code. There are two primary flaws in the detected kernels: only one path of a conditional may appear, and larger kernels may be represented as multiple kernels. Both occur because the prior algorithm was only looking for temporal affinity. Kernels must be able to execute in a continuous path, which is not guaranteed based on temporal affinity alone.
If a kernel contains a conditional block, these blocks will usually be detected; however, if one of the paths happens less often than
there is a possibility that it will not be selected by the greedy algorithm. This is prone to happen relatively often in image processing kernels as the edges of the image often require a conditional to either zero pad the data or extend the size of the image. Figure(a)a contains an example DD-Path for just such an example where there is a conditional branch that occurs 1% of the time and is thusly not selected as part of the kernel.
Kernels that are significantly wider that the algorithm window will often be detected multiple times due to the nature of seed selection. The set of basic blocks representing this kernel may overlap minimally or not at all. Figure (b)b is an exaggerated example of this occurring with an analysis radius of one for a five-block wide kernel. The first kernel selects basic block B as its seed and hits the threshold using just basic blocks A, B, and C. Basic block D is unrepresented, so it then gets chosen as a seed and hits the required threshold be using blocks C, D, and E. Both kernels are valid detections, but they both represent the same kernel and need to be fused.
To legalize a kernel, we need to find all the basic blocks that a kernel may potentially enter between iterations without entering a different kernel. The smoothed kernel can be collected by following the trace once more now that we know the kernel sets. Our kernel legalization algorithm streams through the trace and creates a set of basic blocks for each kernel that contain the basic blocks we have seen since we were last in the kernel block set. If we re-enter the it before entering another kernel, we know these blocks are a part of the kernel, so we add these blocks to the set. If we enter a new kernel, we know we have exited the current kernel and should clear its contents.
This algorithm adds every basic block that occurs within the body of a kernel without adding basic blocks from other kernels. Nested kernels are still detected successfully with this algorithm because the set of new blocks are cleared when we encounter the beginning of a kernel again. Kernels that contain more than double the algorithm window will result in at least two identical sets of basic blocks which can be trivially detected as identical. Conditionals within the middle of a kernel are fused directly. Conditionals from the end of a kernel must ultimately reenter the kernel before a new iteration is started to branch back to the beginning of the kernel due to the way LLVM schedules conditionals.
TraceAtlas is an open source tool that can successfully dynamically trace an application with the primary limitation being the runtime of the program. We have an average runtime dilation factor of nine and produce one megabyte for every second of execution. The resultant traces can be efficiently analyzed to detect kernels using a greedy algorithm to identify the flavor of kernels of interest to the user. Over 7000 kernels were identified from 392 programs originating from 13 open source projects.
All of the applications traced for this paper were compiled with clang 8.0 and linked with lld 8.0. Zlib 1.2.11 was used for compression. Python 3.6 was used for the kernel detection algorithms. Table II enumerates the collection of libraries and benchmarks used for tracing, the applications using that library, the number of kernels detected, and the version of the library or benchmark. The libraries were selected to sample a variety of general kernel-based tools.
The applications were run on Xeon E5-2650 processors with the trace data being written to an Intel SSD DC S3500 with 1TB of storage. Each application was executed nine times and the median value was reported to filter out noise. Each sample was given 48 hours to execute and was allowed to consume an entire terabyte of storage. The application samples that exceeded these limits were canceled and are not present in Table I or Figures 8 and 9.
|Dhrystone and Whetstone||3||183|
6.2 Motivating Use: Producer-Consumer Pipeline Extraction
With kernels fully discovered, the same trace as before can be used to analyze real memory dependencies between kernels. By tracking loads and stores, it is possible to identify which kernel instances wrote to and read from a particular address.
A producer-consumer pair occurs when a producer stores data that the consumer reads from. Specifically, the producer creates a store that puts a value into memory at a specific address. This address is now most recently written to by that kernel instance and any kernel instance that loads data from that address is a consumer of that instance.
Detecting the memory dependencies between kernels can also be done with a small memory overhead that is bounded by the memory used during the original computation. By going through the trace, it is possible to note when you enter and exit a kernel, thus denoting individual kernel instances. Then, for every store one can write an instance ID to a dictionary where the address is the key. Every load can then check if this address is in the dictionary and if it is then the current kernel instance knows it read from the matching value in the dictionary.
A limited top-level producer-consumer pipeline extractor tool has been developed and is available with TraceAtlas. This was used to aid in the verification of the kernels detected by TraceAtlas and to generate many of the figures in this work.
There are many uses of kernels with a producer-consumer graph that are not explored in this work. With the producer-consumer graph, kernels can be fed as an input into static analysis tools like LLVM-Poly to raise their performance. Furthermore, a compiler can be written to transform the detected kernels into code compatible with other kernel libraries to perform fusing in TACO or scheduling in Halide. In short, the kernels that this tool detects can be easily fed into other tools to perform additional analysis or further raise the performance with no expert programmer involvement.
6.3 Dynamic Tracing
Our techniques resulted in a reduction of trace size by a factor of fifty relative to naive techniques and a reduction of trace time by a factor of two relative to Zlib compression. Each strategy approached a different problem and achieved a speedup in its domain and often for others as well. The net improvement of all these optimizations has resulted in a tracing technique that runs within a reasonable time dilation.
Two variants of the TraceAtlas tool have been developed: one that traces the path of an execution and another that also traces the memory address of every load and store. This is the only technique in Table I that encodes address information. Each subsequent optimization also enables additional applications to be traced due to lower performance limitations. Clustered basic block dumping allowed for the tracing of some shorter cpp programs in Eigen. The current version of TraceAtlas has no currently known limitations beyond only supporting single threaded applications and the performance overhead of tracing.
Figure 8 shows that TraceAtlas, as a rule, produces smaller traces due to the information being compressed statically before being fed to Zlib. Occasionally, raw Zlib compression and IO clustering will produce a smaller trace, likely due to the method Zlib uses to compress the trace being more efficient that our current encoding scheme. For additional details, see section 8. Table I shows that on average, TraceAtlas produces a trace that is half the size of what is produced from raw Zlib compression.
Figure 9 shows that TraceAtlas performs above average across the space; however, once it takes more than a second to trace, TraceAtlas performs significantly faster. When the trace takes TraceAtlas more than a second, it had a time dilation factor of 10.18 while Zlib compression had a factor of 265.5. This shows that TraceAtlas is twenty-six times faster than Zlib compression for larger applications. For additional analysis see section 8.
6.4 Kernel Extraction Algorithm
The greedy algorithm described in section 5.1 has three tuning parameters: the window radius, the probabilistic sum threshold, and the minimum iteration count to be deemed hot code. In the following graphs, these values were set at a radius of 7, a threshold of 0.95, and a hot code threshold of 512 iterations unless otherwise specified. The “Ratio of Traces Explained” in Figures 10, 11, and12 refers to the ratio of the basic blocks within traces that are contained within our kernels divided by the total number of basic blocks in every trace summed together.
Ideally, we would like to extract as many kernels as possible while simultaneously explaining as much of the program as possible;however, different values for these parameters will cause kernels to be fused, lowering the kernel count while increasing the explained trace ratio. Depending on the specific application and use case, different values maybe required. The optimal values found for our corpus are marked with an asterisk in Figures 10, 11, and 12 for ease of identification.
The algorithmic threshold refers to the minimum likelihood for the greedy algorithm to terminate. Figure 10 shows that the number of raw and smoothed kernels detected decreases in an approximately linear fashion, while the Trace Ratio experiences a significant jump at a probability of . Because of this we deem to be the minimum recommended value for this parameter with also being reasonable. Higher values will continue to improve the desired explained trace ratio, but the number of kernels will continue to fuse, potentially obscuring kernels that may be of interest.
The window radius in Figure 11 refers to the maximum distance for a basic block to be away from another for them to be deemed adjacent. Figure 11 demonstrates that there is a rapid fusion of kernels as the window radius grows from one to four before leveling out. There is a similar rapid growth in the ratio of the basic blocks explained in this same area, but there is also a trailing off effect once the window grows to a width of nine. Due to this behavior, the optimal window width lies somewhere between 5-8 with them all achieving similar performance.
The hot code threshold refers to the minimum number of times a piece of code must happen to be deemed a kernel. This parameter is more a matter of user choice where the code behavior informs the decision. Larger applications running across large data sets will achieve more succinct kernels at higher thresholds, but they will also potentially explain less of the overall application. Figure 12 shows that lower ratios do explain more of the program with a higher number of kernels, but depending on the nature of the code these kernels have the potential to not be kernels of interest to the user. Given these curves we find a threshold of 256-512 to be optimal.
7 Case Studies
To validate our approach, four categories that are used to write or call kernels in modern code were identified: loop based grammars, recursive function calls, kernel libraries such as FFTW, and pipeline libraries such as OpenCV. Loop based grammars are the most basic types of kernels and are most easily represented as a simple for-loop or while-loop. Recursive function calls are another common technique for custom kernels. If a user is trying to call a common kernel such as an FFT, they are often wrapped in a library to simplify the interface. This can be done on a low level such as in FFTW, or at a higher level such that there is a single function call that calls an entire pipeline of kernels such as in OpenCV.
Upon analysis of these four cases, TraceAtlas successfully identified all the kernels expected in our source applications. It also identified other code sections as kernels which were not predicted but emerged as an artifact of the implementation of the kernels in the source code. Upon further analysis, these artifacts are kernels but were not anticipated by the programmer due to abstractions. As a final verification, a real-world radio application was analyzed, and the detected kernels were compared with those predicted by the author.
7.1 For Loops and Recursion
For loops are the prototypical kernel. It contains two primary components which we needed to detect: the loop body and the loop iterator.
Figure 13 contains an example kernel application which averages two adjacent values. Above is the source code and below is the resultant DD-Path. It iterates over 511 points and executes the kernel upon each. When transformed into LLVM IR, a for loop is transformed into four to five basic blocks: one for the initializer, one for the conditional, one for the body, one for the incrementor, and possibly one for the exit.
The portion of this graph that composes our kernel is the cycle present in Figure 13. This cycle is formed from the incrementor, conditional, and body. Upon tracing the code, a kernel is generated where the three basic blocks that compose the cycle are identified as the kernel.
Recursive functions operate in a similar fashion. They utilize identical attributes: a conditional, a body, and an enumerator which functions as the exit condition. Figure 14 contains a kernel that is functionally identical to Figure 13.
Figure 14 shows the modified DD-path of the program. The kernel first enters the body and then exits if the exit condition is true ( in this case), otherwise it calls the next kernel instance before exiting.
Code written within a library has the potential to be written like that in Figures 13 and 14; however, once a user is using a library it is usually to accomplish something either complex or optimized for performance. Both significantly complicate the graph and obfuscate what the kernels are. Fortunately, dynamic traces still extract all the basic blocks that execute though so the repetitive nature of kernels is still easily identifiable.
Our example is a 512-point 1D-FFT scheduled and executed using FFTW based upon code from their website. Upon kernel extraction we identified seven kernels whose blocks are available in Table III. Upon back referencing the basic blocks to the LLVM IR, one can identify the primary function of each kernel. Kernel 0 reads in input data for the FFT. Kernels 1 and 2 malloc memory for the working set and the output respectively. Kernels 3 and 4 move data into the buffers for the appropriate preparation. Kernels 5 and 6 are responsible for doing the body of the FFT and contain the most blocks. This matches what one would expect for an FFT with additional memory scheduling occurring for performance.
|Kernel Index||Basic Blocks Composing the Kernel|
7.3 Complex application
More complex libraries further abstract away the concept of a kernel down to a single function call which schedules and executes an entire pipeline of kernels. Brisk from OpenCV is an example of this.
Figure 15 contains a subset of the kernels detected in the basic Brisk implementation with the kernels that are dominant in the execution present. Red kernels execute earlier in the computation, and green kernels execute later. The primary body of Brisk as described in the paper happens almost entirely within kernel 15 with other kernels being used for scheduling, file IO, and buffer management. There is a single centralized kernel which schedules the other kernels to execute from.
The final extraction contained 33 individual kernels spanning 158,000 basic blocks. This shows that although libraries such as OpenCV perform a lot of computational backend to schedule a kernel, TraceAtlas is still able to detect the kernels that occur and their temporal ordering.
To verify that the graph in Figure 15 is correct, each of the top-level kernels in the resultant producer-consumer graph were manually analyzed. The parent functions were identified to map them back to the original source. Kernel A is a control kernel which schedules other kernels to execute. Kernels B, C, and D were kernel generation code. E, F, G, H, and I are all responsible for reading in a png file and converting it to accessible buffers. J, K, and L are all responsible for moving memory. Kernel M schedules the kernel to iterate across the color channels, and N performs the actual Brisk computation. Finally, O writes the resultant data to disk.
7.4 Expert Verified Application: OFDM
As a final verification, an OFDM radio system was acquired. The input application was analyzed with TraceAtlas and a top-level producer-consumer pipeline was extracted from the input application with the detected kernels. The pipeline resulted in Figure 16.
The result was shown to the original programmer who is unfamiliar with TraceAtlas, and they were able to successfully label all the top-level kernels that were detected. This shows that all kernels expected by an expert user are properly represented. Additional nested kernels were detected which were not expected by the user but were found to originate from loops that were deemed insignificant even though they were a large portion of the computation.
This shows that code from the wild can be easily analyzed by TraceAtlas. It successfully identifies kernels in both our input synthetic applications as well as wild code that was not written with TraceAtlas in mind.
TraceAtlas has a dependency upon LLVM IR being available for the source application. This limits the use of the tool to C and C++ projects with the potential for Fortran through the use of f18 or fc or other languages with a custom LLVM frontend. Supporting other interpretive222Interpretive here refers to languages that require an external binary to execute. Compiled languages with library dependencies such as C# and Java do not apply. languages such as Python and Perl are possible if the control binary and supporting libraries are compiled with TraceAtlas injected. Closed source tools can still be used through an IR lifter such as LLVM-mctoll .
The traces produced by TraceAtlas are dynamic application traces. As a result, they will only represent the paths taken in the execution. Unexplored kernels and dead code will not be traced or identified as a kernel in the resultant trace. Care must be taken to ensure that the sample application executes the code of interest and that it executes a sufficient number of times given a user’s hot code threshold. Similarly, dynamic traces still require a significant amount of space and time to analyze; however, it is now viable to trace an application.
Asanovic et al. identified many archetypes of kernels. Of the thirteen archetypes generated, our kernel definition (section 3) successfully identifies twelve of them. For kernel analysis, one of the preferred attributes is that it composes a significant portion of the computation. In combinatorial logic, it is either contained within a loop and detected, or it is executed sparingly in which case it does not belong in the same category as kernels. We specifically cannot identify finite state machines. A finite state machine is composed of two different kernels: a state controller and a state handler. Both kernels are detected and properly represented by TraceAtlas, but a finite state machine is not detected as a kernel itself. Multi-threading parallelism through task queues are believed to be compatible with this framework, but the tool chain lacks thread support, so this aspect is yet unexplored.
When analyzing the TraceAtlas performance, some applications achieved better performance with plain Zlib compression than were achieved with our tool. TraceAtlas creates an output trace as a UTF-8 encoded text file where each line is a key-value pair for the point of interest, traditionally the basic block ID. For some applications, it is plausible that the raw Zlib trace is more easily identified than that of TraceAtlas’s encoded trace. On average, encoding is advantageous and becomes mandatory for larger traces. This variability within Zlib results in some traces being smaller or even being produced more quickly than is accomplished by TraceAtlas. This could potentially be fixed by changing the encoding of the trace to a more succinct binary format such that Zlib more easily detects repetition within the trace.
We have demonstrated that dynamic traces are easily generated from application sources. This was tested against 16 libraries and 10,507 individual kernels instances Tracing code can be injected efficiently, and tracing only inserts a time dilation factor of nine. The resultant trace only produced one megabyte of data per second of trace. All of this makes potential dynamic trace tools an inexpensive and plausible solution to many problems.
We have also demonstrated a log space algorithm for analyzing the trace that can detect kernels from within the source application. This technique detects all the input kernels by our definition. We have applied it successfully to hundreds of input applications. We have also demonstrated the resultant data generated from the most common kernel execution types.
These two techniques combined allow for easy analysis and extraction of kernels from the source code based upon its dynamic execution. The resultant kernels can then be extracted and optimized with external tools to achieve even better performance improvements with no user interaction.
This material is based on research sponsored by Air Force Research Laboratory (AFRL) and Defense Advanced Research Projects Agency (DARPA) under agreement number FA8650-18-2-7860. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusion contained herein are those of the authors and should not be interpreted as necessarily representing the social policies or endorsements, either expressed or implied, of Air Force Research Laboratory (AFRL) and Defence Advanced Research Projects Agency (DARPA) or the U.S. Government.
The author would like to thank Seth Abraham, Umit Ogras, and Carole Wu for helping edit this paper. The author would also like to thank Liangliang Chang, Mukul Gupta, Vamsi Lanka, Sriharsha Uppu, and Benjamin Willis for helping write input applications for the application corpus.
-  H. Esmaeilzadeh, E. Blem, R. S. Amant, K. Sankaralingam, and D. Burger, “Dark silicon and the end of multicore scaling,” IEEE Micro, vol. 32, no. 3, pp. 122–134, 2012.
-  R. Hameed, W. Qadeer, M. Wachs, O. Azizi, A. Solomatnikov, B. C. Lee, S. Richardson, C. Kozyrakis, and M. Horowitz, “Understanding sources of inefficiency in general-purpose chips,” ACM SIGARCH Computer Architecture News, vol. 38, no. 3, pp. 37–47, 2010.
-  S. Kim and J. C. Browne, “General approach to mapping of parallel computations upon multiprocessor architectures,” in Proceedings of the International Conference on Parallel Processing, vol. 3, pp. 1–8, 1988.
-  J. Sugerman, K. Fatahalian, S. Boulos, K. Akeley, and P. Hanrahan, “Gramps: A programming model for graphics pipelines,” ACM Trans. Graph., vol. 28, pp. 4:1–4:11, Feb. 2009.
-  K. Asanovic, R. Bodik, B. C. Catanzaro, J. J. Gebis, P. Husbands, K. Keutzer, D. A. Patterson, W. L. Plishker, J. Shalf, S. W. Williams, et al., “The landscape of parallel computing research: A view from berkeley,” tech. rep., Technical Report UCB/EECS-2006-183, EECS Department, University of …, 2006.
-  “Nvidia cuda compute unified device architecture.”
-  J. Ragan-Kelley, A. Adams, D. Sharlet, C. Barnes, S. Paris, M. Levoy, S. Amarasinghe, and F. Durand, “Halide: Decoupling algorithms from schedules for high-performance image processing,” Commun. ACM, vol. 61, pp. 106–115, Dec. 2017.
-  T. Grosser, H. Zheng, R. Aloor, A. Simbürger, A. Größlinger, and L.-N. Pouchet, “Polly-polyhedral optimization in llvm,” in Proceedings of the First International Workshop on Polyhedral Compilation Techniques (IMPACT), vol. 2011, p. 1, 2011.
-  D. Lehmann and M. Pradel, “Wasabi: A framework for dynamically analyzing webassembly,” in Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’19, (New York, NY, USA), pp. 1045–1058, ACM, 2019.
-  M. D. Ernst, “Static and dynamic analysis: Synergy and duality,” in WODA 2003: ICSE Workshop on Dynamic Analysis, pp. 24–27, New Mexico State University Portland, OR, 2003.
-  S. Dissegna, F. Logozzo, and F. Ranzato, “Tracing compilation by abstract interpretation,” SIGPLAN Not., vol. 49, pp. 47–59, Jan. 2014.
-  C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood, “Pin: Building customized program analysis tools with dynamic instrumentation,” in Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’05, (New York, NY, USA), pp. 190–200, ACM, 2005.
-  M. A. Holliday and C. S. Ellis, “Accuracy of memory reference traces of parallel computations in trace-drive simulation,” IEEE Transactions on Parallel and Distributed Systems, vol. 3, pp. 97–109, Jan 1992.
-  J. Zhai, T. Sheng, J. He, W. Chen, and W. Zheng, “Efficiently acquiring communication traces for large-scale parallel applications,” IEEE Transactions on Parallel and Distributed Systems, vol. 22, no. 11, pp. 1862–1870, 2011.
-  B. Girish, “Greendroid: Energy efficient mobile application processor using android system,” 2019.
-  E. Slaughter, Regent: A high-productivity programming language for implicit parallelism with logical regions. PhD thesis, Ph. D. dissertation, Stanford University, 2017.
-  K. Angstadt, J. Wadden, W. Weimer, and K. Skadron, “Portable programming with rapid,” IEEE Transactions on Parallel and Distributed Systems, vol. 30, pp. 939–952, April 2019.
-  R. Hoque, T. Herault, G. Bosilca, and J. Dongarra, “Dynamic task discovery in parsec: A data-flow task-based runtime,” in Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ScalA ’17, (New York, NY, USA), pp. 6:1–6:8, ACM, 2017.
-  W. Thies, M. Karczmarek, and S. Amarasinghe, “Streamit: A language for streaming applications,” in International Conference on Compiler Construction, pp. 179–196, Springer, 2002.
-  Y. S. Shao, B. Reagen, G.-Y. Wei, and D. Brooks, “Aladdin: A pre-rtl, power-performance accelerator simulator enabling large design space exploration of customized architectures,” SIGARCH Comput. Archit. News, vol. 42, pp. 97–108, June 2014.
-  W. Lee, E. Slaughter, M. Bauer, S. Treichler, T. Warszawski, M. Garland, and A. Aiken, “Dynamic tracing: Memoization of task graphs for dynamic task-based runtimes,” in SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 441–453, Nov 2018.
-  E. Schkufza, R. Sharma, and A. Aiken, “Stochastic program optimization,” Commun. ACM, vol. 59, pp. 114–122, Jan. 2016.
E. Deniz and A. Sen, “Using machine learning techniques to detect parallel patterns of multi-threaded applications,”International Journal of Parallel Programming, vol. 44, pp. 867–900, Aug 2016.
-  Z. Wang and M. O’Boyle, “Machine learning in compiler optimization,” Proceedings of the IEEE, vol. 106, pp. 1879–1901, Nov 2018.
-  K. Asanovic, R. Bodik, J. Demmel, T. Keaveny, K. Keutzer, J. Kubiatowicz, N. Morgan, D. Patterson, K. Sen, J. Wawrzynek, et al., “A view of the parallel computing landscape,” Communications of the ACM, vol. 52, no. 10, pp. 56–67, 2009.
-  A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spafford, V. Tipparaju, and J. S. Vetter, “The scalable heterogeneous computing (shoc) benchmark suite,” in Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, GPGPU-3, (New York, NY, USA), pp. 63–74, ACM, 2010.
-  M. Galassi, J. Davies, J. Theiler, B. Gough, G. Jungman, P. Alken, M. Booth, F. Rossi, and R. Ulerich, GNU scientific library. Network Theory Limited, 2002.
M. Frigo and S. Johnson, “Fftw. fast fourier transform library,”URL http://www.fftw.org/, 2005.
-  G. Guennebaud, B. Jacob, et al., “Eigen v3.” http://eigen.tuxfamily.org/, 2010.
G. Bradski and A. Kaehler,
Learning OpenCV: Computer vision with the OpenCV library. ” O’Reilly Media, Inc.”, 2008.
-  J. Gaeddert, “Liquid dsp-software-defined radio digital signal processing library.” https://liquidsdr.org/.
-  F. Bellard and M. Niedermayer, “F.f.m.p.e.g.,” [online] Available:2012.
-  K. Barker, T. Benson, D. Campbell, D. Ediger, R. Gioiosa, A. Hoisie, D. Kerbyson, J. Manzano, A. Marquez, L. Song, N. Tallent, and A. Tumeo, PERFECT (Power Efficiency Revolution For Embedded Computing Technologies) Benchmark Suite Manual. Pacific Northwest National Laboratory and Georgia Tech Research Institute, December 2013. http://hpc.pnnl.gov/projects/PERFECT/.
-  C. Sanderson and R. Curtin, “Armadillo: a template-based c++ library for linear algebra,” Journal of Open Source Software, vol. 1, no. 2, p. 26, 2016.
-  W. Thies and S. Amarasinghe, “An empirical characterization of stream programs and its implications for language and compiler design,” in 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT), pp. 365–376, IEEE, 2010.
-  A. Buluç, T. Mattson, S. McMillan, J. Moreira, and C. Yang, “Design of the graphblas api for c,” in 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 643–652, IEEE, 2017.
-  A. Ltd, “mbed tls,” 2015.
-  P. Karn, “Fec library version 3.0.1,” 2007.
-  “DAW, Netlib Library.” http://www.netlib.org/benchmark/.
-  S. Thomas, C. Gohkale, E. Tanuwidjaja, T. Chong, D. Lau, S. Garcia, and M. B. Taylor, “Cortexsuite: A synthetic brain benchmark suite,” 2014 IEEE International Symposium on Workload Characterization (IISWC), pp. 76–79, 2014.
-  Audiofilter, “spuce.”
F. Kjolstad, S. Kamil, S. Chou, D. Lugato, and S. Amarasinghe, “The tensor algebra compiler,”Proceedings of the ACM on Programming Languages, vol. 1, no. OOPSLA, p. 77, 2017.
-  S. Leutenegger, M. Chli, and R. Siegwart, “Brisk: Binary robust invariant scalable keypoints,” in 2011 IEEE international conference on computer vision (ICCV), pp. 2548–2555, Ieee, 2011.
-  Flang-Compiler, “flang-compiler/f18,” Nov 2019.
-  Compiler-Tree-Technologies, “compiler-tree-technologies/fc,” Aug 2019.
-  Microsoft, “microsoft/llvm-mctoll,” Jun 2019.