With the emergence of Exascale computing, processors with more number of cores on a chip with complicated cache designs have become common. Such complicated designs come with a number of challenges , such as efficient use of available compute power, modeling the performance of these complex caches, to name a few. Parallel applications that run on these multi-cores try to leverage this extensive computing power. One of the key factors that determine the performance of a parallel application on a multi-core processor is the availability of data to the cores. One way to measure the data availability of an application is through the cache utilization ability, which at the end impacts the performance.
Moreover, modern processors contain shared caches, which have a significant impact on the performance of an application in the form of data locality and inter-process communication. These factors are both complex to analyze and hardware dependent. Simulation, based on software/hardware co-design helps to better understand the behavior of applications and study the impact of the above factors on performance in a multi-core configuration. Co-design, in fact, helps to tune the performance of an application. Most of the efforts in co-design have focused on getting simulation data from cycle-accurate dynamic instrumentation tools [2, 3, 4, 5]. However, these simulations require a large number of runs and require experimentation with a number of hardware configurations. Such configurations include variations in cache hierarchies, core counts and problem sizes, all of which contribute to increasing the complexity of design space exploration. Therefore, using cycle-accurate dynamic simulators to evaluate performance can be extremely challenging.
In analyzing the performance of a cache, Reuse Distance Analysis  is one of the commonly used technique. Reuse distance is the number of unique memory references between two same consecutive accesses. For sequential programs, reuse analysis is architecture-independent, whereas for parallel programs that run on multi-core processors, reuse distance dependents on how the memory references of threads interact. Therefore, on multi-cores, Concurrent Reuse Distance (CRD) profiles  use a global stack to quantify reuse across thread-interleaved memory references, and thus accounts for data sharing and interaction between threads accessing shared caches. However, CRD profiles are unscalable as the core count increases, the thread interactions increase, thereby the memory traces get large, which significantly changes the CRD profiles.
In this paper, we introduce a scalable reuse distance-based shared memory model in order to estimate the shared cache hit rates for different applications. We use a translator based on theRose compiler 
to get the threaded version of a parallel code written in OpenMP. We develop a compiler-driven technique to identify the basic blocks of the threaded programs in measuring the exact probabilities of executing a given basic block of a program. We explore through different interleaving strategies of execution in order to mimic the behavior of multi-threaded programs on shared memory processors. In fact, these strategies are carried-out at the LLVM basic block. The memory references of the labeled trace generated from the sequential run of the program and apply an analytical probabilistic method to measure reuse profiles of applications. Using these profiles, we measure cache hit-rates of the applications. We evaluate our approach on two benchmarks Breadth-First Search (BFS) and Matrix Multiplication (MatMul) on two processors an Intel Core I7 and an Intel Xeon. We compare our results with the actual hit-rates calculated from the memory trace generated using Valgrind  Lackey tool. The results show that the model predicted cache hit-rates are similar to that of the actual hit-rates.
Ii-a Execution of Parallel Application: Fork Join Model
OpenMP uses fork-join model for parallel execution of a program. The program begins as a sequential application with master thread. When the first parallel region construct is encountered, the master thread splits itself (forks) into a team of identical parallel threads. The forked traces have access to all the variables from the master thread so those are shared variables. They also have private variables of their own and can identify themselves with unique thread number. When the team threads complete executing all the statements in the parallel region, they synchronize and terminate (join), leaving only the master thread. It’s also possible to have nested parallelism where one of the team threads can split itself.
Ii-B Reuse Distance Analysis
Reuse distance (D) of a memory address which is also known as LRU stack distance is the number of unique memory references between two consecutive reference to the same address. Note that, when a memory address is referenced for the first time, the reuse distance, D is . Reuse profile is the histogram of reuse distances for all memory references of a program. Reuse distance analysis measures the locality [11, 12] of an application, which can be used to predict cache performance of that application [13, 14, 15] and make cache management policy decisions . For a fully associative cache with capacity C, the reuse distance of a memory reference will always trigger a cache miss, if D C. Fig. 1 shows the reuse distance calculation for a sample trace. In the example, of memory references will cause a compulsory cache miss. If we consider that cache size is 3 then 13% of all memory references will cause a capacity cache miss.
Reuse distance analysis is powerful in the sense that it is architecture-independent for sequential applications. The same reuse profile can be used to measure the performance of different cache sizes. This saves a significant amount of time in cache hit-rate analysis as we don’t have to collect memory traces for different cache configurations. A number of attempts [17, 18, 19] demonstrated the use of memory traces for reuse profile calculations. These approaches use binary instrumentation tools to collect memory traces. The memory traces used in most of these attempts are large and time-consuming to process, thereby unscalable. However, recent attempts in [20, 21, 22, 23] demonstrated analytical models that scale with a small input run of a program.
Ii-C Multicore Reuse Distances
Most of the multicore processors contain both shared and private caches. Although the locality of references of a parallel program in a multicore processor is somewhat architecture-specific, it largely depends on the characteristics of the application itself. The corresponding thread of a core accesses the private cache while the shared cache is accessed through all the cores. Two separate reuse profiles, Concurrent and Private-stack reuse profiles (CRD and PRD) are used to model shared and private caches . To measure concurrent reuse profile, we can interleave memory references from all cores on a single LRU stack. This interleaving causes different types of interaction: dilation, overlap, and interception . Figure 2 shows the memory references from two cores. For access of a at time 4 CRD is 2 where it’s D is 1. Here CRD is larger than D which shows dilation. On the other hand data sharing reduces dilation. For the memory reference of a at time 8 CRD is 2 although there are 3 memory references between two consecutive memory references at time 4 and 8. This shows overlapping as d is accessed by both cores inside reuse interval of a. Again for the reference b at time 9, the reused data itself is shared. So its CRD is 2 which is less than its D.
Several recent works have focused on CRD profile and performance prediction of the shared cache [26, 27, 28, 29, 30]. Recently researchers attempted to use analytical model and sampling to speed up the performance prediction [24, 31, 32, 33, 34, 35]. All these models require trace collection from parallel execution of an application for different number of threads. On the other hand, our model collects trace once from the sequential run of the application. From that trace, we predict shared cache performance for a different number of threads. This makes our model highly scalable with core counts.
Iii Scalable Analytical Shared Memory Model
The scalable analytical shared memory model is a parameterized model for performance prediction of parallel codes. We leverage reuse distance analysis to determine multicore reuse profile of a parallel program that runs on multiple cores. The reuse profiles are later used to determine the hit-rates at different cache hierarchies.
Iii-a Source Code Translation
In the first step, we convert the OpenMP application to an intermediate threaded code using OpenMP translator in Rose  compiler. In the translation process, the parallel sections of the original code are transformed into intermediate threaded code. The threaded version of the code contains XOMP wrapper functions (generated from the Rose compiler), that call GNU OpenMP (GOMP) (when compiled with GCC) library functions. The private variables of the parallel sections are translated as local variables in the corresponding threaded version of the code. Each thread under execution runs the XOMP wrapper functions, where each thread allocates memory for the local variables. For the shared variables, the functions in the threaded version of the code receive a structure of pointers as a parameter. At the beginning of these functions, all the members of those structures are assigned to locally declared pointers. We create separate labels for these shared parts of the code, which is where the assignments happen so that the memory trace of the shared variables of the code are grouped in the corresponding basic block labels (described in the next section).
Iii-B Shared Memory Trace Generation
In the second step, we generate LLVM basic block labeled memory trace of the translated threaded program. The LLVM IR of the source code consists of basic blocks, each of which contains a single entry and exit points. In producing the trace, we execute the translated code sequentially with smaller input size of a program. We use LLVM based instrumentation facilities to generate the basic block labeled memory trace of the translated program by sequential execution. In this memory trace the ith basic block(BBi) of the labeled trace contains all the memory addresses that are accessed as a result of executing the corresponding straight-line code of (BBi). For each shared variable, marked with a label, we gather the corresponding memory references of those shared sections. Using this memory trace resulted from a sequential execution, we mimic the behavior of the parallel program and generate the private memory trace on each thread under execution.
As OpenMP works in fork-join model, the parallel section of the OpenMP code is executed at the same time on different cores. Each core has its own copy of the parallel section of the code. Note that the master thread execute the sequential part of the code along with the corresponding parallel section of the code. We mimic this behavior by making copies of the memory references of each basic block of the parallel sections. Our mimicking strategy tries to replicate the memory trace of an OpenMP program on multiple cores. For example, if the parallel program is using 4 cores, then we take four copies of a basic block, we then add an offset to the memory addresses for each of the cores under execution. The basic blocks that we select obviously belong to the parallel region of the code. The offset is carried out on all the memory references of all the basic blocks of a parallel region except for the memory references of the shared variables. This mimicking strategy helps to show that the memory references belong to different cores.
We choose the offset in a way such that the mimicked memory references do not match with the original memory references that are produced in the sequential execution. The original OpenMP execution contains different scheduling strategies to execute the parallel sections of the code. Recording memory traces for such scheduling strategies are cumbersome and time and memory inefficient. Therefore, our model in this paper tries to generate a trace that looks similar to the OpenMP scheduled traces. Here, we use the above recorded sequential trace with offsets in order to mimic the interleaving of threads. Our interleaving strategy distributes the corresponding memory threads equally among multiple threads under execution, which is similar to following static scheduling of OpenMP. We can also distribute the iterations to the cores according to an adaptive chunk sizes. We can explore through various interleaving and scheduling strategies, which is beyond the scope of this paper and we reserve it for future work. In this way, we find the private memory trace for each core under execution.
For shared memory, we take the labeled shared memory references from the sequential trace. These memory references can be found in the private trace from above, whereas we are not adding any offset to these variables. However, these references have the same memory address across multiple cores in the mimicked trace. These memory references are interleaved in the same way as the private variables. Similar traces can be generated with binary instrumentation tools such as Valgrind  and Pin , however, we use LLVM based tool to leverage the advantage of basic blocks of a program. Valgrind Lackey tool runs the multi-threaded program sequentially per thread, where the interleaving of the threads is left to the operating system, therefore the resultant memory trace happens to be a multi-threaded trace. On the other hand, with Pin, one has to produce a sequential trace and propose interleaving strategies. Nonetheless, we can not derive a basic block labeled trace from Pin as opposed to LLVM instrumentation. Once we have the memory trace that mimics the multi-core execution, we estimate the reuse distances for each reference in the trace.
Iii-C Probabilistic Reuse Profile Estimation
In our next step, we analytically estimate the probabilistic concurrent reuse profile of the program (Pr(D)) from our mimicked share memory trace. The conventional methods of measuring the reuse profile are expensive due to large memory traces. We use a technique described in [20, 21] which produces reuse distances at smaller input sizes of a program and from those reuse distances, we estimate reuse profiles at larger input sets. We estimate the reuse profile of a program using Eq. 1.
where n(BB) is the number of basic blocks, is the apriori probability of executing a basic block and is the conditional probability of executing a basic block.
Iii-D Hit Rate Estimation
With the probabilistic reuse profiles (Pr(D)), we measure the shared cache hit-rates using an analytical memory model, a stack distance based cache model (SDCM) . Eq. 2 shows how to measure the hit-rate at a given reuse distance ().
where D is the reuse distance, A is the associativity of the cache and B is cache size in terms of number of blocks (which is cache size over cache line size).
Iv Experimental Results
We evaluate the proposed model on two different CPU architectures with two benchmarks. The two architectures are Intel Core-I7 and Intel Xeon processors while the benchmarks are, breadth first search (BFS) and matrix multiplication (MatMul). The shared cache () sizes of both the architectures are and respectively. In order to validate, we compare the model predicted shared cache hit-rates with that of the hit-rates from Valgrind. Note that, we use Valgrind Lackey tool to get the memory trace of the benchmarks, compute reuse profiles, from which we measure the hit rates. Since the Lackey tool instruments the binary of the application at runtime, we generate the memory trace by defining the number of threads using environment variable. We run the experiments for varying number of core counts, 1, 2, 4, 8, 12, 16, 20, 32. In each of these experiments we run Valgrind to record the traces and the reuse profiles, while our model runs the programs on one core and mimics the execution for other core counts.
Fig. 4 and Fig. 4 show the comparison hit-rates between our model and Valgrind Lackey tool for two different caches on BFS application, at different number of processors for an input size of 100 nodes. Fig. 6 and Fig. 6 show a similar comparison for MatMul with matrix sizes of 62x15 and 15x7. On an Intel Core-I7, the average hit-rates of BFS are 98.49% for Valgrind and 96.77% for our model, while for MatMul 97.83% and 95.31%, respectively. Similarly, on an Intel Xeon, the hit-rates are 98.51% and 96.77% for BFS and 97.90% and 95.31% for MatMul, respectively. On both the benchmarks, for both the cache configurations, the results show that our model predicts the shared memory hit-rates accurately.
Reuse distance analysis has been a valuable tool for application locality prediction, cache modeling, and performance prediction. This paper extends reuse distance analysis to the parallel application domain by accounting for inter-thread interactions for shared caches in a static way. It statically predicts the memory trace of a parallel application on shared cache from the memory trace of sequential execution of the code. This makes the method very scalar with core counts and cache sizes. The results show that our model is very accurate for the shared cache hit-rate prediction. Furthermore, the model takes the cache configuration parameters as input which makes it suitable for design space exploration and cache sensitivity analysis.
In the future, we will extend the model to predict the hit-rates on private caches such as and/or . Furthermore, We will explore various scheduling strategies of OpenMP with different interleaving strategies.
-  J. Shalf, S. Dosanjh, and J. Morrison, “Exascale computing technology challenges,” in High Performance Computing for Computational Science – VECPAR 2010, J. M. L. M. Palma, M. Daydé, O. Marques, and J. C. Lopes, Eds. Springer, 2011, pp. 1–25.
-  G. Sun, C. Hughes, C. Kim, J. Zhao, C. Xu, Y. Xie, and Y. Chen, “Moguls: A model to explore the memory hierarchy for bandwidth improvements,” in 2011 38th Annual International Symposium on Computer Architecture (ISCA), June 2011, pp. 377–388.
-  J. D. Davis, J. Laudon, and K. Olukotun, “Maximizing cmp throughput with mediocre cores,” in 14th International Conference on Parallel Architectures and Compilation Techniques (PACT’05), Sep. 2005, pp. 51–62.
-  M. Ekman and P. Stenstrom, “Performance and power impact of issue-width in chip-multiprocessor cores,” in 2003 International Conference on Parallel Processing, 2003. Proceedings., Oct 2003, pp. 359–368.
-  Jaehyuk Huh, D. Burger, and S. W. Keckler, “Exploring the design space of future cmps,” in Proceedings of International Conference on Parallel Architectures and Compilation Techniques, 2001, pp. 199–210.
-  R. L. Mattson, J. Gecsei, D. R. Slutz, and I. L. Traiger, “Evaluation techniques for storage hierarchies,” IBM Systems Journal, vol. 9, no. 2, pp. 78–117, 1970.
-  C. Ding and T. Chilimbi, “A composable model for analyzing locality of multi-threaded programs,” Tech. Rep. MSR-TR-2009-107, August 2009.
-  C. Liao, D. J. Quinlan, T. Panas, and B. R. de Supinski, “A rose-based openmp 3.0 research compiler supporting multiple runtime libraries,” in Proceedings of the 6th International Conference on Beyond Loop Level Parallelism in OpenMP: Accelerators, Tasking and More, ser. IWOMP’10. Berlin, Heidelberg: Springer-Verlag, 2010, pp. 15–28.
-  C. Lattner and V. Adve, “Llvm: A compilation framework for lifelong program analysis & transformation,” in Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization, ser. CGO ’04. Washington, DC, USA: IEEE Computer Society, 2004, pp. 75–86.
-  N. Nethercote and J. Seward, “Valgrind: A framework for heavyweight dynamic binary instrumentation,” SIGPLAN Not., vol. 42, no. 6, pp. 89–100, 2007.
-  C. Ding and Y. Zhong, “Predicting whole-program locality through reuse distance analysis,” SIGPLAN Not., vol. 38, no. 5, pp. 245–257, 2003.
-  Y. Zhong, X. Shen, and C. Ding, “Program locality analysis using reuse distance,” ACM Trans. Program. Lang. Syst., vol. 31, no. 6, pp. 20:1–20:39, 2009.
-  K. Beyls and E. H. D’Hollander, “Reuse distance as a metric for cache behavior,” in In Proceedings of the IASTED Conference on Parallel and Distributed Computing and Systems, 2001, pp. 617–662.
-  R. Sen and D. A. Wood, “Reuse-based online models for caches,” in Proceedings of the ACM SIGMETRICS/International Conference on Measurement and Modeling of Computer Systems, ser. SIGMETRICS ’13. New York, NY, USA: ACM, 2013, pp. 279–292.
-  C. Cascaval and D. A. Padua, “Estimating cache misses and locality using stack distances,” in Proceedings of the 17th Annual International Conference on Supercomputing, ser. ICS ’03. New York, NY, USA: ACM, 2003, pp. 150–159.
-  N. Duong, D. Zhao, T. Kim, R. Cammarota, M. Valero, and A. V. Veidenbaum, “Improving cache management policies using dynamic reuse distances,” in Proceedings of IEEE/ACM International Symposium on Microarchitecture, ser. MICRO-45. IEEE, 2012, pp. 389–400.
-  C. Ding and Y. Zhong, “Reuse distance analysis,” Rochester, NY, USA, Tech. Rep., 2001.
-  E. Berg and E. Hagersten, “StatCache: a probabilistic approach to efficient and accurate data locality analysis,” in IEEE International Symposium on - ISPASS Performance Analysis of Systems and Software, 2004, 2004, pp. 20–27.
-  S. V. den Steen, S. Eyerman, S. D. Pestel, M. Mechri, T. E. Carlson, D. Black-Schaffer, E. Hagersten, and L. Eeckhout, “Analytical processor performance and power modeling using micro–architecture independent characteristics,” IEEE Transactions on Computers, vol. 65, no. 12, pp. 3537–3551, 2016.
-  G. Chennupati, N. Santhi, S. Eidenbenz, and S. Thulasidasan, “An analytical memory hierarchy model for performance prediction,” in 2017 Winter Simulation Conference (WSC). IEEE, 2017, pp. 908–919.
-  G. Chennupati, N. Santhi, R. Bird, S. Thulasidasan, A. A. Badawy, S. Misra, and S. Eidenbenz, “A scalable analytical memory model for CPU performance prediction,” in Proceedings of the 8th International Workshop on High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation, PMBS, S. Jarvis et al., Ed., Denver, CO, USA, 2017a, pp. 114–135.
G. Chennupati, N. Santhi, and S. Eidenbenz, “Scalable performance prediction
of codes with memory hierarchy and pipelines,” in
Proceedings of the ACM SIGSIM Conference on Principles of Advanced Discrete Simulation, SIGSIM-PADS, 2019, pp. 13–24.
-  X. Shen, J. Shaw, B. Meeker, and C. Ding, “Locality approximation using time,” in Proceedings of the 34th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, ser. POPL ’07. New York, NY, USA: ACM, 2007, pp. 55–61.
-  Y. Jiang, E. Z. Zhang, K. Tian, and X. Shen, “Is reuse distance applicable to data locality analysis on chip multiprocessors?” in Proceedings of the 19th Joint European Conference on Theory and Practice of Software, International Conference on Compiler Construction, ser. CC’10/ETAPS’10. Berlin, Heidelberg: Springer-Verlag, 2010, pp. 264–282.
-  M.-J. Wu and D. Yeung, “Efficient reuse distance analysis of multicore scaling for loop-based parallel programs,” ACM Trans. Comput. Syst., vol. 31, no. 1, pp. 1:1–1:37, 2013.
-  C. Ding, X. Xiang, B. Bao, H. Luo, Y.-W. Luo, and X.-L. Wang, “Performance metrics and models for shared cache,” Journal of Computer Science and Technology, vol. 29, no. 4, pp. 692–712, Jul 2014.
-  S. Van den Steen and L. Eeckhout, “Modeling superscalar processor memory-level parallelism,” IEEE Computer Architecture Letters, vol. 17, no. 1, pp. 9–12, Jan 2018.
-  X. Shi, F. Su, J.-K. Peir, Y. Xia, and Z. Yang, “Modeling and stack simulation of cmp cache capacity and accessibility,” IEEE Trans. Parallel Distrib. Syst., vol. 20, no. 12, pp. 1752–1763, Dec. 2009. [Online]. Available: http://dx.doi.org/10.1109/TPDS.2009.31
-  Y. Zhong, S. G. Dropsho, X. Shen, A. Studer, and C. Ding, “Miss rate prediction across program inputs and cache configurations,” IEEE Transactions on Computers, vol. 56, no. 3, pp. 328–343, March 2007.
-  G. Ceballos, E. Hagersten, and D. Black-Schaffer, “Formalizing data locality in task parallel applications,” in Algorithms and Architectures for Parallel Processing. Cham: Springer International Publishing, 2016, pp. 43–61.
-  J. M. Sabarimuthu and T. G. Venkatesh, “Analytical derivation of concurrent reuse distance profile for multi-threaded application running on chip multi-processor,” IEEE Transactions on Parallel and Distributed Systems, vol. 30, no. 8, pp. 1704–1721, Aug 2019.
-  D. L. Schuff, M. Kulkarni, and V. S. Pai, “Accelerating multicore reuse distance analysis with sampling and parallelization,” in Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, ser. PACT ’10. New York, NY, USA: ACM, 2010, pp. 53–64.
-  R. K. V. Maeda, Q. Cai, J. Xu, Z. Wang, and Z. Tian, “Fast and accurate exploration of multi-level caches using hierarchical reuse distance,” in 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), Feb 2017, pp. 145–156.
-  E. Berg, H. Zeffer, and E. Hagersten, “A statistical multiprocessor cache model,” in 2006 IEEE International Symposium on Performance Analysis of Systems and Software, March 2006, pp. 89–99.
-  D. L. Schuff, B. S. Parsons, and V. S. Pai, “Multicore-aware reuse distance analysis,” in 2010 IEEE International Symposium on Parallel Distributed Processing, Workshops and Phd Forum (IPDPSW), April 2010, pp. 1–8.
-  V. J. Reddi, A. Settle, D. A. Connors, and R. S. Cohn, “Pin: A binary instrumentation tool for computer architecture research and education,” in Proceedings of the 2004 Workshop on Computer Architecture Education: Held in Conjunction with the 31st International Symposium on Computer Architecture, ser. WCAE ’04. ACM, 2004.
-  M. Brehob and R. Enbody, “An analytical model of locality and caching,” Tech. Rep. MSU-CSE-99-31, 1999.