Memory and Parallelism Analysis Using a Platform-Independent Approach

04/18/2019
by   Stefano Corda, et al.
Ericsson
TU Eindhoven
0

Emerging computing architectures such as near-memory computing (NMC) promise improved performance for applications by reducing the data movement between CPU and memory. However, detecting such applications is not a trivial task. In this ongoing work, we extend the state-of-the-art platform-independent software analysis tool with NMC related metrics such as memory entropy, spatial locality, data-level, and basic-block-level parallelism. These metrics help to identify the applications more suitable for NMC architectures.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

06/24/2019

Platform Independent Software Analysis for Near Memory Computing

Near-memory Computing (NMC) promises improved performance for the applic...
08/07/2019

Near-Memory Computing: Past, Present, and Future

The conventional approach of moving data to the CPU for computation has ...
03/12/2020

Characterizing Optimizations to Memory Access Patterns using Architecture-Independent Program Features

High-performance computing developers are faced with the challenge of op...
07/26/2019

A Workload and Programming Ease Driven Perspective of Processing-in-Memory

Many modern and emerging applications must process increasingly large vo...
11/25/2020

Rapid Exploration of Optimization Strategies on Advanced Architectures using TestSNAP and LAMMPS

The exascale race is at an end with the announcement of the Aurora and F...
10/15/2018

CAVBench: A Benchmark Suite for Connected and Autonomous Vehicles

Connected and autonomous vehicles (CAVs) have recently attracted a signi...
11/25/2021

STRETCH: Virtual Shared-Nothing Parallelism for Scalable and Elastic Stream Processing

Stream processing applications extract value from raw data through Direc...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

With the demise of Dennard scaling and slowing of Moore’s law, computing performance is hitting a plateau (Esmaeilzadeh et al., 2011). Furthermore, the improvements in memory and processor technology have grown at different speeds, which is infamously termed as the memory wall  (Wulf and McKee, 1995). Additionally, the current big-data era, where data is being generated in a massive amount and across multiple domains, has created a demand for novel memory-centric designs rather than conventional compute-centric designs (Singh et al., 2018).

Therefore, it has been made even more crucial for computer designer understand the characteristics of these emerging applications to optimize future systems for their target workloads. Among the different approaches that have been used in the past for application characterization, a micro-architecture independent approach provides more relevant workload characteristics than by using e.g. HW performance counters. In this scope, the platform-independent software analysis tool PISA (Anghel et al., 2015) was developed. PISA is capable of extracting results in a true micro-architecture agnostic manner, by utilizing the LLVM compiler framework Intermediate Representation (IR). Therefore, we extend the capabilities of PISA to extract NMC related characteristics.

The rest of the paper is organized as follows: Section 2 presents the background information concerning the tool and the related works. In Section 3 we describe the characterization metrics we embedded into PISA. In Section 4 we show and discuss the characterization results. Finally, Section 5 concludes this paper.

2. Background and Related Work

PISA is based on the LLVM Compiler framework. It uses an intermediate representation (IR), which is generated from the application source using a clang front-end, to represent the application code in a generic way. This IR is independent of the target architecture and has RISC-like instruction set. Therefore, these features can be used to perform application analysis or optimization using the opt tool. LLVM’s IR has a hierarchical structure: a basic-block that consists of instructions and represents a single entry and single exit section of code; a function that is a set of basic-blocks; and a module that represents the application and contains functions and global variables.

Figure 1. Overview of the Platform-Independent Software Analysis Tool (Anghel et al., 2015).

PISA’s architecture is shown in Figure 1. Initially, the application source code, e.g. C/C++ code, is translated into the LLVM’s IR. PISA exploits the opt tool to perform LLVM’s IR optimizations and to perform the instrumentation process using an LLVM pass. This process is done by inserting calls to the external analysis library throughout the application’s IR. The last step consists of a linking process that generates a native executable. On running this executable, we can obtain analysis results for specified metrics in JSON format. PISA can extract metrics such as instruction mix, branch entropy, data reuse distance, etc.

The analysis reconstructs and analyzes the program’s instruction flow. This is possible because the analysis library is aware of the single entry and exit point for each basic-block. All the instructions contained in the basic-block are analyzed using the external library methods. Moreover, PISA supports the MPI and OpenMP standards allowing the analysis of multi-threaded and multi-process applications. The tool’s overhead depends on the analysis performed. On average the execution-time increases by two to three orders of magnitude in comparison to the non-instrumented code. However, since the analysis is target-independent, this has to be performed only once per application and dataset.

Considerable effort has been already spent in realizing platform independent characterization tools. Cabezas (Cabezas, 2012) proposed a tool that can extract different features from workloads but has many limitations: the compiler community no longer supports the LLVM interpreter, and the target applications should be single threaded. Another tool has been developed by Shao et al. (Sophia Shao and Brooks, 2013). It can extract interesting metrics such as memory entropy and branch entropy. However, this tool has some limitations: it is based on the IDJIT IR (just-in-time compilation) that has compatibility problems with OpenMP and MPI, thus being limited to sequential applications. The state-of-the-art tool (called PISA) in workload characterization was presented by Anghel et al. (Anghel et al., 2015). PISA can analyze multi-threaded applications supporting the OpenMP and the MPI standards. PISA can extract the metrics such as instruction mix, branch entropy, data reuse distance, etc. We extended PISA with metrics directed towards NMC such as memory entropy and spatial locality, data-level and basic-block-level parallelism.

3. Characterization Metrics

In this section we present the metrics we integrated into PISA. We focus on the memory behaviour, which is essential to decide if an application should be accelerated with a NMC architecture, and on the parallelism behaviour, which is crucial to decide if a specific parallel architecture should be integrated into an NMC system.

3.1. Memory entropy

The first metric related to memory behavior that we added is the memory entropy. The memory entropy measures the randomness of the memory addresses accessed. If the memory entropy is high, which means a higher cache miss ratio, the application may benefit from 3D-stacked memory because of the volume of data moved from the main memory to the caches. In information theory, Shannon’s formula (Shannon, 1951) is used to capture entropy.

We embed in PISA, the formula defined by Yen et al. (Yen et al., 2008). They applied Shannon’s definition to memory addresses: , where

is a n-bit random variable,

is the occurrence probability for the value

and is the number of values that can take. is defined by: where .

In the last formula the addresses are represented as , where is the number of different addresses accessed during the execution. Each address is in the range , where is the length of the address in bits. If every address has the same occurrence probability the entropy is ; if only one address is accessed the entropy is . Otherwise the entropy is within and . The memory entropy metric does not distinguish whether the accesses contain sequential patterns or random accesses. Therefore we need additional metrics, like spatial locality.

3.2. Data reuse distance for multiple cache-line size and spatial locality

Data reuse distance or data temporal reuse (DTR) is a helpful metric to detect cache inefficiencies. The DTR of an address is the number of unique addresses accessed since the last reference of the requested data. This metric is present in the default framework. However, the tool could compute it only for a fixed cache line size, which represents the address granularity. We extend the DTR computation and compute it starting from the word size to the value selected by the user. This extends the available analysis opportunities e.g. we use it to compute the spatial locality metric.

Spatial locality, which measures the probability of accessing nearby memory locations, can be derived from DTR. We extend PISA with the spatial locality score inspired by Gu et al. (Gu et al., 2009)

. The key idea behind this spatial locality score is to detect a reduction in DTR when doubling the cache line size. To estimate the spatial locality in a program two elements are fundamental: 1) histograms of data reuse distance for different cache line sizes, 2) distribution maps to keep track of changes in DTR for each access doubling the cache line size. Histograms are used to compute the DTR distribution probability for different cache-line sizes. In

(Gu et al., 2009) the reuse signature has been defined as a pair , where is a series of consecutive DTR ranges of bins, represented as: . These bins are a logarithmic progression defined as: . is the distribution probabilities of the bin . This reuse signature is used later to normalize the results.

The next step consists of building a distribution map. This map keeps track of each change in the DTR for every access. The distribution map has rows representing the bins using a cache line size and columns representing the bins using a doubled cache line size . Each cell is the probability of the bin using a cache line size to change in a bin using a cache line size . Differently from (Gu et al., 2009) we compute the sum of the cells in a row where We do that because we want to express all the changes in data reuse distance. The spatial locality score for the bin is: .

To compute the spatial locality score related to a pair of cache line sizes we first compute the absolute values of the weighted sum that uses the probabilities included in the reuse signature and then use the formula proposed by (Gu et al., 2009) to calculate the total score, which is the logarithmic weighted sum of absolute values: .

The weighted score gives more importance to lower cache line sizes pairs. Nevertheless, this can be interpreted as higher relevance of these lower pairs because bigger cache line sizes bring massive data transfers. Usually, application with low spatial locality perform very bad on traditional systems with cache hierarchies because a small portion of data is utilized compared to the data loaded from the main memory to the caches.

3.3. Data-level parallelism

Data-level parallelism (DLP) measures the average length of vector instructions that is used to optimize a program. DLP could be interesting for NMC when employing specific SIMD processing units in the logic layer of the 3D-stacked memory.

PISA can extract the instruction-level parallelism for all the instructions (see Figure 2, CFG on the left) and additionally per instruction category such as control, memory, etc. (see Figure 2, CFG in the center). As shown in the CFG on the right in Figure 2, we extract the ILP score per opcode and call it as where opcode can be load, store, add, etc. This metric represents the number of instructions with the same opcode that could run in parallel. Next, we compute the weighted average value for DLP using the weighted sum over all opcodes of . The weights are the frequency of the opcodes calculated by dividing the number of instructions per code with the number of instructions.

As the register allocation step is not performed at the level of intermediate representation, it is not possible to take into account the register consecutiveness in this score. However, we want to show the optimization opportunities for compilers distinguishing between consecutiveness of load/store instruction addresses. We represent this with two scores: without address consecutiveness; with addresses consecutiveness into account. To compute them we use the previous formula changing the value for loads and stores.

Figure 2. Usually to compute ILP it has been used the control flow graph (CFG), left side. The CFG in the center is used to compute the ILP per type of instructions. On the right our per opcode specialized CFG.

3.4. Basic-block level parallelism

A basic-block is the smallest component in the LLVM’s IR that can be considered as a potential parallelizable task. Basic-block level parallelism (BBLP) is a potential metric for NMC because it can estimate the task level parallelism in the application. The parallel tasks can be offloaded to multiple compute units located on the logic layer of a 3D-stacked memory.

To estimate BBLP in a workload, we develop a metric similar to ILP and DLP. It is based on the assumption that a basic-block, which is a set of instructions, can only be executed sequentially. Since loop index count could put an artificially tight constraint on the parallelism, we assume two different basic-block scheduling approaches (see Figure 3): 1) all the dependencies between basic-block are considered; 2) we consider a smart scheduling, assuming a compiler that can optimize loop index update dependencies. The difference between the two approaches can give an idea, as in the DLP case, of the optimization opportunities for compilers. We compute the two scores derived from the two scheduling options using the following formula: , where represents the cycle of the last executed instruction using the proposed scheduling approaches (red numbers in Figure 3(b,c)). represent the total number of instructions (see Figure 3.a).

Figure 3. BBLP/PBBLP methodology: a) example of LLVM dynamic trace; b) real scheduling for BBLP computation taking in account all dependencies; c) simplified scheduling for BBLP computation not taking in account dependencies such as loop index update (in a) dependency between instruction 15 and 17); d) PBBLP values for each basic block (second and third block are a repeated basic block, since there is only a loop index dependecy, the PBBLP is equal to 2).

We also aim to estimate the presence of data parallel loops. Data parallel loops consists of basic-blocks that are repeated without any dependencies among their instances. A fast and straightforward estimation can be done by assigning a value to each basic-block between and the number of instances. When a basic-block has only one instance or all its instances have dependencies among them the score is . Instead, when all its instances don’t have dependencies among them the value is maximal and equal to the number of instances. Contrariwise, the score is within the range described above. Other assumptions we made are: skip index update dependencies and omit basic-blocks that are used only for index update.

After assigning a score to each basic-block (), we compute the weighted average value for PBBLP using the weighted sum over all scores (). The weights are the frequency of the basic-block instances calculated by dividing the number of instances per basic-block with the number of total instances. . Since this metric is an estimation we call it as potential basic-block level parallelism (PBBLP).

4. Characterization results

We present the the characterization results of selected applications from PolyBench (Pouchet, 2012) and Rodinia (Che et al., 2009) benchmarks (see Figure 4) employing the proposed metrics. Memory entropy, in Figure 4.a, is strictly related to the dimension of the address space accessed by a workload. Indeed, applications with larger address space have higher entropy because they are accessing many different addresses. We also plot memory entropy changes at different granularity cutting the least-significant bits (LSBs) of the address to represent larger data access granularity. Furthermore, we highlight in Rodinia’s applications the cut of 2 LSBs because they are accessing integer (4Byte locations). We notice that applications like bp and gramschmidt have higher values of entropy and they should benefit from NMC architectures. Contrariwise, the other applications have similar values except for cholesky, bfs and kmeans.

Related to memory behavior, we show in Figure 4.b the spatial locality of the workloads. As expected, we can distinguish different behaviors among the benchmarks. bp and gramschmidth show an interesting behavior with high entropy and low spatial locality. For instance, in gramschmidt accesses to the matrix are done by column and diagonally. However, the matrix allocation is done in a row-major order. These applications should be good candidates for NMC because they use a large address space with low locality. An opposite trend is detected for cholesky, where the entropy is one of the lowest value and the spatial locality is the highest value.

A considerable amount of applications show a spatial locality lower than 0.25 and they should benefit from NMC systems. However, applications with high spatial locality like cholesky could also benefit from NMC mostly when increasing the data-set and consequently moving more data off-chip and exploiting SIMD architectures.

Figure 4. Application characterization results:(a) Memory Entropy; (b) Spatial Locality; (c) Parallelism.

Figure 4.c shows the parallelism characterization of workloads. As expected in the Berkeley dwarfs for the data-level parallelism analysis (Asanovic et al., 2006), matrix multiplication based algorithms show the highest values. Moreover, the difference between the two proposed DLP scores seems to be very limited. Only small variations can be noticed, for instance in trmm and syrk. Here, the difference is due to loads/stores with non-sequential accesses and could be improved by a compiler exploiting data mapping techniques. Instead the BBLP scores show a significant difference for cholesky and limited differences for bfs and syrk. These results highlight possible parallelism optimizations that can be performed by compilers.

Finally, the score tries to highlight the presence of data parallel loops and gives an estimation of how much parallelism can be achieved using vectorization or loop unrolling strategies. Applications with high level of parallelism could benefit from NMC systems that provided multicores or SIMD architectures in the logic layer on top of the 3D-stacked memory.

5. Conclusions

Emerging computing architectures in their first stages of development such as near-memory computing (NMC) lack proper tools for specialized workload profiling. In this scope, we have extended PISA, a state-of-the-art application characterization tool, with NMC related metrics. Particularly, we have concentrated on analyzing the memory accesses and parallelism behaviors: data-level parallelism, basic-block level parallelism, memory entropy, and spatial locality. In a separate work we will explain the correlation between the proposed metrics and the performance on an NMC system.

Acknowledgements.
This work was performed in the framework of Horizon 2020 program and is funded by European Commission under Marie Sklodow- ska-Curie Innovative Training Networks European Industrial Doctorate (Project ID: 676240). We would like to thank Fetahi Wuhib and Wolfgang John from Ericsson Research for their feedback on the draft of the paper.

References

  • (1)
  • Anghel et al. (2015) Andreea Anghel, Laura Mihaela Vasilescu, Rik Jongerius, Gero Dittmann, and Giovanni Mariani. 2015. An Instrumentation Approach for Hardware-Agnostic Software Characterization. International Journal of Parallel Programming 44 (2015), 924–948.
  • Asanovic et al. (2006) Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis, Parry Husbands, Kurt Keutzer, David A Patterson, William Lester Plishker, John Shalf, Samuel Webb Williams, et al. 2006. The landscape of parallel computing research: A view from berkeley. Technical Report. Technical Report UCB/EECS-2006-183, EECS Department, University of ….
  • Cabezas (2012) V Cabezas. 2012. A tool for analysis and visualization of application properties. Technical Report. Technical Report RZ3834, IBM.
  • Che et al. (2009) S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. Lee, and K. Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In 2009 IEEE International Symposium on Workload Characterization (IISWC). 44–54. https://doi.org/10.1109/IISWC.2009.5306797
  • Esmaeilzadeh et al. (2011) Hadi Esmaeilzadeh, Emily Blem, Renee St. Amant, Karthikeyan Sankaralingam, and Doug Burger. 2011. Dark Silicon and the End of Multicore Scaling. SIGARCH Comput. Archit. News 39, 3 (June 2011), 365–376. https://doi.org/10.1145/2024723.2000108
  • Gu et al. (2009) Xiaoming Gu, Ian Christopher, Tongxin Bai, Chengliang Zhang, and Chen Ding. 2009. A Component Model of Spatial Locality. In Proceedings of the 2009 International Symposium on Memory Management (ISMM ’09). ACM, New York, NY, USA, 99–108. https://doi.org/10.1145/1542431.1542446
  • Pouchet (2012) Louis-Noël Pouchet. 2012. Polybench: The polyhedral benchmark suite. http://web.cse.ohio-state.edu/~pouchet.2/software/polybench/
  • Shannon (1951) C. E. Shannon. 1951. Prediction and entropy of printed English. The Bell System Technical Journal 30, 1 (Jan 1951), 50–64. https://doi.org/10.1002/j.1538-7305.1951.tb01366.x
  • Singh et al. (2018) Gagandeep Singh, Lorenzo Chelini, Stefano Corda, Ahsan Javed Awan, Sander Stuijk, Roel Jordans, Henk Corporaal, and Albert-Jan Boonstra. 2018. A Review of Near-Memory Computing Architectures: Opportunities and Challenges. In 21st Euromicro Conference on Digital System Design, DSD 2018. 608–617.
  • Sophia Shao and Brooks (2013) Yakun Sophia Shao and David Brooks. 2013. ISA-independent workload characterization and its implications for specialized architectures. (04 2013), 245–255.
  • Wulf and McKee (1995) Wm. A. Wulf and Sally A. McKee. 1995. Hitting the Memory Wall: Implications of the Obvious. SIGARCH Comput. Archit. News 23, 1 (March 1995), 20–24. https://doi.org/10.1145/216585.216588
  • Yen et al. (2008) Luke Yen, Stark C. Draper, and Mark D. Hill. 2008. Notary: Hardware Techniques to Enhance Signatures. In Proceedings of the 41st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 41). IEEE Computer Society, Washington, DC, USA, 234–245. https://doi.org/10.1109/MICRO.2008.4771794