When designing a multicore processor, there are many architectural decisions to make at various levels of the design. These decisions include number of cores; last level cache (LLC) capacity, line size, associativity, replacement policy, and distribution; cache hierarchy, coherency, and interconnects; lower level cache capacity, line size, associativity, and replacement policy; TLB size and associativity; branch predictors and training; multiply and divide latency and throughput; FPU latency and throughput; and many more decisions depending on the specific optimizations performed in the processor pipeline such as superscalar, out-of-order execution, and register renaming.
With all these decisions, it is desirable to have a quick way to evaluate different architectures. Unfortunately cycle-accurate simulation of large multicore processors is very time-consuming. Where a processor may be able to finish a benchmark in a couple of seconds, a cycle accurate simulator may take 3 to 5 orders of magnitude longer, ranging from hours to days depending on the complexity of the microarchitecture.
It is common for designers to sacrifice cycle-accuracy to gain simulation speedup, and to do this, they use approximate simulation methods including trade-offs like simplified processor and memory models, truncated simulation, and sampling. These approximate simulation methods allow for the exploration of many architectures, but they introduce quantitative errors in the reported results. If these errors are larger than the reported gains from an architectural improvement, then it is possible that the improvement shown in simulation does not translate to the real design. A deeper problem is that these errors are difficult to characterize and therefore difficult to bound quantitatively.
It would be desirable to have an upper bound on these quantitative errors to prove the effectiveness of an architectural improvement, but it would require a cycle-accurate model running alongside the approximate simulator to give these errors with enough certainty.
Instead of finding bounds on quantitative errors, we could just focus on the accuracy of qualitative results obtained by these simulators, such as which LLC replacement policy is better. If a simulator is shown to make comparisons between architectures accurately, then it could be used for accurate exploration of a large design space, and a cycle-accurate simulator could be used to simulate the final selection of parameters to get the accurate quantitative effects of the selected processor additions.
One simplified processor model that introduces a large quantitative error is the 1-IPC core approximation. This core approximation takes only one cycle to execute an instruction if there is no cache miss. While this core is clearly not useful to simulate differences in pipeline optimizations such as out-of-order execution, superscalar, and register renaming (since these reduce to the same 1-IPC model)—it could be useful to simulate architectural features outside of the pipeline such as those mentioned in the first paragraph.
This paper offers a careful study of the qualitative usefulness of a simplified 1-IPC core model by inserting it into a full system cycle-accurate simulator and comparing the results in various architectural studies to the results generated by the 10-stage in-order accurate core model (ACC). Figure 1 shows a diagram of a typical experiment we would use to compare the two core models. In this example, we are still able to accurately rank the policies by the inexact metric results from the approximate model. Instead of comparing the simplified core model to a single reference machine, we are changing the machine with the simplified core model in tandem with the reference machine and show that the qualitative trends in each machine are very similar.
Our simplified 1-IPC core model is an approximation of a full system cycle-accurate PowerPC simulator: Arete . We chose Arete due to the cycle-accuracy and the ease of modification to create new machines to test new policies. Since the 1-IPC processor is derived from a fully cycle-accurate simulator, the 1-IPC processor contains a cycle-accurate cache hierarchy and it runs the same OS as the cycle-accurate processor. Since the core model is the only difference between the two systems, the effects of the 1-IPC core model approximation are isolated in our studies.
This paper makes a several key contributions to the study of approximate core models in architectural simulation:
This paper proposes a new methodology of validating simplified simulation models, which focuses on the trends of metric values across benchmarks and architectures, instead of errors of absolute metric values.
This paper presents an in depth, side-by-side comparison of a 1-IPC model with a cycle-accurate memory system against a fully cycle-accurate processor.
This paper shows that two models agree for most cases, and in cases of mismatch, it is often when the magnitude of difference between the two choices is on the same order as the variations from run to run found in the cycle-accurate model.
This paper employs a simple way previously proposed in the statistic field to estimate the variation of the cycle-accurate model.
Throughout this paper we investigate the difference between the two models in various settings and across various metrics. Section II covers related work. Section III presents the our methodology to validate simplified models. Section IV includes an overview of the accurate and simplified core models and their implementation in the Arete simulator. Section V uses both models to evaluate three LLC replacement policies. Section VI presents a study on the scalability of multithreaded benchmarks on processors varying from a single core up to 16 cores with both types of core models. Section VII compares three different branch predictors across the two core models for branch predictor accuracy. Section VIII concludes the paper.
Ii Related Work
Computer architects rely heavily on simulation to evaluate new techniques, and the community has developed numerous simulators with various degrees of accuracy. For example, GEMS , M5  and MARSS  can achieve cycle-accuracy but run at relatively low speed, typically around hundreds of KIPS (thousand instructions per second). Many other simulators trade off accuracy for higher speed, such as COTSon  which adopts a functional-directed simulation methodology.
In recent years, a class of simulators, such as CMP$im , Graphite , Sniper , ZSim , etc., have been built on top of Pin , a dynamic binary instrumentation framework developed by Intel. Pin-based simulators are able to run at much higher simulation speed than sequential cycle-accurate simulators by leveraging native direct execution on the host machine, while still achieving good accuracy by using complex data structures and algorithms to track timing events. For example, Sniper is developed by combining the framework of Graphite with interval simulation , a technique that takes into account not only the delays caused by various miss events (such as cache miss and branch misprediction) but also the possible overlap of such miss events especially in out-of-order processors. Though fairly accurate, these simulators are typically not cycle-accurate.
For these non-cycle-accurate simulators, their designers have all carefully justified the simplifications that have been made, and may have also validated the simulator against the real machine. For example, ZSim has been validated against a 6-core Westmere machine. However, the user of these simulators may simply take the simulator as accurate and will change the parameters and target architecture according to his needs. It is possible that such changes will invalidate the designer’s original justification about the simplifications in the simulator, because the user may be unaware of the designer’s logic and argument. Nowatzki et al.  enumerate eight common pitfalls of some modern simulators, which may induce large simulation error if the user is unaware of them. Therefore, whether a simulator with simplification can show similar trends as the real machine could be a problem.
There are several early studies focusing on simulator validation. Gibson et al.  compared applications’ execution time derived from several FLASH simulators against the actual execution time on a real FLASH machine. Desikan et al.  validated the sim-alpha simulator against a real Compaq DS-10L workstation, and they mainly focused on the IPC error. Cain et al.  built a precise and accurate PowerPC processor model, and used it as the reference model in evaluating other simulators with simplifications. They conducted comparison using various metrics.
Our study differs from the previous work mainly in two aspects. The first one is that we focus on whether the simplified model shows similar trends as the reference model while previous work concentrated on absolute errors in performance metrics. Although it is always useful to get accurate metric values from simulation, we argue that the primary goal of using a simulator is to explore new architectural changes and see whether they lead to improvement. Therefore, a simulator can be considered “accurate” if it can qualitatively correctly predict the improvement caused by the change. The second difference is that we apply architectural changes to our reference model (i.e. the cycle-accurate model) since we want to see whether the simplified model can predict the effects of changes correctly. In previous studies no change was made to the reference model. The two studies by Gibson and Desikan were unable to do this since they used real machines which could not be modified as the reference model. Especially, Desikan et al. did make changes to the target architecture but they could only evaluate the improvement on the simulators instead of the reference model (i.e. the real machine). If we want to evaluate the impact of an architectural change on the real machine, we will effectively need at least two machines, one with the change and other without. Our cycle-accurate simulator has the flexibility that enables us to modify microarchitecture. Although Cain et al. also had such flexibility, they did not choose to do it.
Previous studies focusing on the inaccuracy of the 1-IPC core model have paid special attention to the influence of not modelling wrong path memory references in out-of-order cores. Mutlu et al.  studied the effects of wrong path memory references on the performance of an out-of-order superscalar uniprocessor. They argued that wrong path memory references can pollute the L2 cache and may act as prefetches. Ignoring them will cause a large error in the measured IPC. Sendag et al.  studied the impact of wrong path memory references on a 16-core (out-of-order) shared-memory multiprocessor. They showed a substantial portion of cache access, coherence traffic, replacement, etc. are introduced by wrong path memory references. Since our target architecture is an in-order core and all data memory accesses are non-speculative, we do not have these effects in our experiments.
Besides using simplified models, benchmark sampling is another technique to speed up architectural simulation. Yi et al.  evaluated two sampling techniques including SimPoint  and SMARTS , as well as other simulation techniques such as truncated simulation. They used a cycle-accurate simulator as baseline and applied two architectural enhancements to it separately. They then investigated whether they could see the same improvement when one of the simulation techniques was applied as the improvement they saw when none of them was used. We conduct our experiments in a similar way. In order to justify the usefulness of the simplified model, we will apply changes to the baseline architecture and see whether the simplified model can show the similar trends as the cycle-accurate model does.
Iii Methodology on Validating Simplified Models
The absolute metric values measured on a simplified model generally cannot match those on the accurate model. Instead of studying the error of absolute metric values, our validation methodology focuses on whether the simplified model matches the accurate model in terms of the relative trends of results across different benchmarks and architectures.
Trends across benchmarks: When studying the trend of metric values across benchmarks, we fix the architecture in evaluation. For each simulation model (i.e.
simplified and accurate), we construct a vector that contains the measured metric values of all benchmarks. After normalizing the vectors, we can compare the distributions across workloads between the simplified model and the corresponding accurate model. This shows how much the simplified model distorts the characteristics of the results across benchmarks.
Trends across architectures: The primary goal of simulation is to compare two architectures and find out the better one. For each simulation model and benchmark, we can compute the ratio of the metric values of two architectures. We refer to such ratios as improvement ratios, because they represent the improvement of one architecture over the other. We can now study the following questions about the fidelity of the simplified model:
Qualitatively, does the simplified model always agree with the accurate model in terms of which architecture is better for each benchmark? In case of disagreement, we study the variation of the results of the accurate model. If the variation is so large that one may reach a wrong qualitative conclusion when he/she runs the experiments only once with the accurate model, then the architecture itself should be considered as brittle.
Quantitatively, how much is the error of the improvement ratios from the simplified model compared to the run-to-run variation on the accurate model? If the error is comparable to or even smaller than the variation, such error should not be viewed as problematic when we cannot run accurate simulation multiple times to reduce the variation.
We will follow the above methodology to validate the simplified model using 1-IPC core in the case study of LLC replacement policy in Section V.
Our 1-IPC simulator and our accurate simulator are based off of the FPGA-based Arete simulator  (we obtained and modified the source code of Arete with the permission of the authors). Arete is a full system simulator that implements the Power ISA–Embedded Environment and boots Linux.
The processor cores in Arete are 10-stage in-order pipelines modeled cycle-accurately using the Latency Insensitive Bounded Network (LI-BDN) technique . LI-BDN allows for refinements of the simulator implementation to reduce the FPGA resource budget and increase the clock speed while preserving cycle-accuracy.
Figure 2 (copied from  with the permission of the authors) shows the structure of the 10-stage in-order pipelines. The front end of the pipeline is five-stage long and includes instruction fetching, branch prediction, instruction decoding, and cracking complex instructions into multiple simple instructions. The back end of the pipeline is also five-stage long and includes reading the register file, resolving branches, accessing memory, executing instructions, handling exceptions, and writing results back to the register file. This pipeline does not include a floating point unit, so all benchmarks are compiled with software floating point operations.
The original Arete simulator connects the core models through a shared L2 cache that lacks a detailed timing model. For these experiments, we wanted cycle-accuracy at the processor-level, not just the core-level. To get this, we expanded the LI-BDNs of the cores to include the L2 cache and main memory. This resulted in detailed timing from the cores up to the memory hierarchy.
Table I shows the base simulator settings of a 4-core system and the timing models used for these experiments. Each simulator used in each of the experiments was implemented on a VC707 FPGA board.
|Core||410-stage in-order pipeline, 2GHz frequency|
|256-entry branch target buffer (BTB)|
|Tournament branch predictor from Alpha 21264 |
|64-entry return address stack (RAS)|
|L1 I cache||432KB, 4-way set associative, 64-byte block|
|1-cycle pipelined hit latency|
|Blocking access (only 1 request in flight)|
|True LRU replacement|
|L1 D cache||432KB, 4-way set associative, 64-byte block|
|1-cycle pipelined hit latency|
|Blocking access (only 1 request in flight)|
|True LRU replacement|
|L2 cache||12MB, 8-way set associative, 64-byte block|
|Shared by all L1 caches, MSI coherence protocol|
|1-cycle tag access, 8-cycle pipelined data access|
|At most 8 requests in flight|
|True LRU replacement|
|Memory||12GB, 120-cycle access latency|
|At most 12 requests in flight|
|12.8GB/s peak bandwidth|
The 1-IPC core model used throughout this paper only stalls during L1 I/D cache misses. Otherwise it will issue, execute and commit each instruction in a single cycle. This simplified core model is derived from the cycle-accurate core model through two steps. First, all the FIFOs that connect adjacent pipeline stages are replaced with bypass FIFOs so that an instruction can flow through all stages in one cycle if there is no cache miss. Second, appropriate stall logic is added that feeds into Fetch-1 and Crack stages in order to ensure that these two stages do not issue new instructions when there are still outstanding instructions in later pipeline stages. With this stall logic, the whole pipeline will only have at most one in-flight instruction when the Crack stage is not active. When the Crack stage is active, the stall logic ensures there is at most one outstanding instruction in the back end of the pipeline and one (i.e. the complex instruction being cracked) in the front end. By applying these 2 changes, the behavior of the pipeline will match our definition of the 1-IPC core model.
For the rest of the paper, we use ACC for the cycle-accurate 10-stage in-order core model for convenience, and we also use ACC model and 1-IPC model to denote the full system simulators that include ACC core models and 1-IPC core models respectively. Note that the only difference between ACC model and 1-IPC model is the core model; the memory hierarchy and rest of the system is always simulated with cycle-accuracy.
V Study 1: Last Level Cache Replacement Policy
In this section, we evaluate three LLC replacement policies on the ACC and 1-IPC models. For this experiment, we used the 4-core configuration shown in Table I. We created three versions of this simulator, one for each policy.
V-a Candidate Policies
This LLC replacement experiment compared the following three replacement policies.
This policy always evicts the LRU cache line, but inserts a new line into either LRU or MRU position. It uses set duelling to dynamically select between two insertion policies: always inserting to MRU position and Bimodal Insertion Policy (BIP). BIP inserts the new line into LRU position with probabilityand into MRU position with probability . We set to , and we assign a MRU bit to each line to approximate the LRU replacement list.
TADRRIP (Thread Aware Dynamic Re-Reference Interval Prediction) .
This policy builds the replacement list using the 2-bit re-reference interval prediction value (RRPV) of each cache line. It also use set dueling to dynamically select between two policies: a static one and a bimodal one. The major difference between them is that the static one always initializes the RRPV of the new cache line to 2, while the bimodal one initializes it to 3 with probability and to 2 with probability . Here we also set to .
We implemented the set duelling mechanism with feedback described in  for both TADIP and TADRRIP. The important parameters are listed here:
four 10-bit policy select counters, one for each core
8 set duelling monitors (SDMs), two for each core
32 dedicated cache sets for each SDM
For convenience, we use LRU, DIP and DRRIP to stand for true LRU, TADIP and TADRRIP in the rest of the paper.
We choose 6 single-thread applications from 3 benchmark suites as shown in Table II. Four applications are taken from the SPEC CINT2006 benchmark suite, pointer is taken from the DIS Stressmark benchmark suite , and stream is taken from the STREAM benchmark suite . pointer performs random memory accesses inside a 4 MB buffer. stream iterates through three arrays, each 781 KB long.
We only choose integer benchmarks here because floating operations are done in software due to the lack of floating point unit in Arete, so the portion of memory access instructions will become very low if we run floating point benchmarks. We do not use other benchmarks in SPEC CINT2006 due to various reasons. Some benchmarks fail to stress the cache, some cannot be cross-compiled to PowerPC, and others take too long to finish.
|Benchmark suite||Application||Input size and parameter|
|SPEC CINT2006||bzip2||train input, byoudoin.jpg|
|gobmk||test input, connect.tst|
|DIS Stressmark||pointer||p12.in (in the new input set)|
|STREAM||stream||array element type unsigned int, array size 200000, kernel is repeated for 1000 times|
With these 6 applications, we generate all possible 15 multiprogrammed application mixes for our 4-core system, according to the alphabetical order of single-thread application names. Table III shows part of the 15 multiprogrammed workloads. We will evaluate the replacement policies using these 15 workloads.
|Workload ID||Process 0||Process 1||Process 2||Process 3|
V-C Measurement Methodology
To test the LLC replacement policies, the multiprogrammed workloads listed in Table III are run on each simulator where each process is pinned to a specific core. For each single-thread application, two special instructions progBeginTrap and progEnd are inserted into the beginning and the end of the program respectively. When the process reaches the progBeginTrap instruction, the core will spin on this instruction until all the processes have reached this instruction. Simulation will terminate when all cores have executed the progEnd instruction at least once; programs that reach the progEnd instruction early restart. The number of cycles simulated for each workload ranges from 30 billion to 70 billion.
We collect statistics from the moment when all cores leave theprogBeginTrap instructions to the time that simulation terminates. The following three metrics are measured for evaluation:
L2 cache misses per 1000 instructions (MPKI);
total throughput , where is the instruction per cycle metric (IPC) for core ; and
weighted speedup , where is the IPC metric for the program on core when the program runs in isolation on a single core with 512KB L2 cache using LRU policy.
In the rest of the paper, we will use MPKI, TTP, and WSU to refer to these these metrics in cases without ambiguity.
In order to derive the WSU metric, we also ran all the single-thread programs seven times on both the ACC and 1-IPC models for the single-core processor with 512KB L2 cache using LRU policy. Note that the WSU metric for the 1-IPC model should be calculated using the single-core IPC of the 1-IPC model. We found the variation among the single-core IPC results is very small, because the standard deviation is less than 0.24% of the mean value for each program. Therefore we only use the mean values of the measured single-core IPC in the calculation of WSU, and ignore the variation of the single-core IPC.
As for the multiprogrammed workloads, we ran each of them for seven times on both the ACC and 1-IPC models. In particular, the multiple runs on the ACC model will capture the variability of the metrics mostly due to operating system effects and non-determinism in the replacement policies. Figure 3 shows the mean values (the cross markers) of the three metrics over all runs for each workload using each model, policy pair. The standard deviations of the measured ACC metric values are represented by the distances from horizontal bars to cross markers, which are fairly small.
In Figure 3, we see that the results for the 1-IPC and ACC models are close for WSU, while they fail to match in magnitude for MPKI and TTP. Despite the difference in magnitudes, we will follow the methodology in Section III to show that two models exhibit the same trends across workloads and replacement policies.
Comparing trends across workloads: We first focus on the difference between ACC and 1-IPC models for a fixed metric and policy. To quantify the trend of metric across workloads with the set policy, we form the vector containing the average metric values for each workload over all runs using model . To isolate the trends across workloads, we first normalize each vector so that its norm becomes 1. We then calculate the Euclidean distance between the normalized vectors of the two models (i.e. and ) to get an estimate of the similarity of their trends.
Table IV shows the distances between the normalized metric vectors of two models for all combinations of policies and metrics. These distances are much smaller than 1, so using the 1-IPC model does not significantly distort the characteristics of the results for each workload.
Comparing improvement ratios: Next we explore the difference between using ACC and 1-IPC models when comparing two policies. The improvement ratio is the ratio of the metric value of the new policy over that of the baseline policy. Besides an average ratio calculated using the mean values of the metrics shown in Figure 3, we can estimate the variation of the improvement ratios on the ACC model, which will be later on compared with the error induced by the inaccuracy of the 1-IPC model.
are random variables that represent the metric values of the new policy and the baseline policy respectively for the same workload on the ACC model. Then the improvement ratio will be. We further assume and
follow normal distributionsand
respectively. The accurate probability distribution ofis very complex, but we can simplify it based on the observation that the variations of metric values are much smaller than the absolute metric values as shown in Figure 3. Therefore we can perform a Taylor expansion on as follow ( and represent and respectively) :
Then approximately follows a normal distribution, in which and
We can estimate and by substituting , , and with the mean values and standard deviations of measured ACC results shown in Figure 3.
We use range as an estimate of the run-to-run variation of the ACC improvement ratio. Since the probability of falling into this range is 80% for the normal distribution of , if one conducts the experiment only once, there will be 10% possibility for the result to be larger than the whole range, and another 10% possibility for the result to be smaller than the whole range.
Figures 46 show the improvement ratios (the cross markers) calculated using mean metric values in Figure 3 for each workload, when comparing each pair of replacement policies using different metrics on both ACC and 1-IPC models. The variation range (i.e. ) of each ratio on the ACC model is illustrated by the interval between horizontal bars in Figures 46.
Table V shows the IDs of workloads that the decisions from 1-IPC model fail to match the decisions from ACC model for each pair of policies – that is, the improvement ratios of ACC and 1-IPC models are on opposite sides of 1. Note that mismatches on the decisions only happen in the comparison of DIP against LRU, so we do not list the other two comparisons in the table.
|MPKI as metric||TTP as metric||WSU as metric|
|DIP vs. LRU||2,4,5,7,8,11||2,4,5,7,8,11||2,4,7|
We notice that the 1-IPC model matches the ACC model exactly in the comparisons of DRRIP versus LRU and DRRIP versus DIP, but there are several mismatches when comparing DIP against LRU. If we look at Figure 4 in more detail, we find that the improvement of DIP over LRU is not obvious. Furthermore, the variation ranges imply that the wrong decisions derived from the 1-IPC model may also be drawn if one only conducts the experiment once even using the ACC model.
For example in Figure (a)a, among the six workloads that exhibit mismatches when using MPKI as metric, the variation ranges of the ACC model for four of them (ID: 2, 4, 8, 11) cross the ratio line corresponding to 1. Namely for each of these four workloads, if one conducts a single experiment using the ACC model, he/she will have at least 10% probability to make the same qualitative decisions as those implied by the 1-IPC model. Similar situations happen for workloads 2, 8 and 11 in Figure (b)b when using TTP as metric, and for workloads 2 and 4 in Figure (c)c when using WSU as metric.
Workloads that exhibit clear mismatches (i.e. excluding workloads that variation ranges of the ACC model cross the ratio line corresponding to 1) only exist in the comparison of DIP against LRU, and are shown in Table VI. Considering that we only have 6 clear mismatches among 135 comparisons (3 metrics3 pairs for comparison15 workloads), 1-IPC model is qualitatively quite accurate. Furthermore, since 1-IPC model exactly agrees with ACC model on the clear improvement of DRRIP over the other policies while the advantage of DIP over LRU is insignificant, we may conclude that 1-IPC model is qualitatively accurate in showing a clear improvement while it may fail to match the ACC model when comparing polices that yield similar performance.
|MPKI as metric||TTP as metric||WSU as metric|
|DIP vs. LRU||5,7||4,5,7||7|
Quantifying the impact of using 1-IPC model:
In order to quantify the impact of using the 1-IPC model instead of the ACC model, we compare the differences between two models against the variation of ACC model in terms of the geometric means of improvement rations. We choose to study the geometric mean, because we can simply compare it to 1 and determine the better policy.
We can also estimate the variation of the geometric mean on the ACC model using a similar way that we have used for the improvement ratio for each single workload. Assume that () is the random variable of the improvement ratio of certain metric for workload when comparing two policies on the ACC model, and that is the geometric mean over all . We have shown earlier that approximately follows a normal distribution , and we have calculated and . Since is also much smaller than , we can again do Taylor expansion on the geometric mean ( represents ):
Therefore the geometric mean of improvement ratios also approximately follows a normal distribution , where and
Figure 7 shows the geometric means of improvement ratios (the cross markers) calculated using mean metric values in Figure 3 for each metric when comparing each pair of replacement policy on both ACC and 1-IPC models. The standard deviation (i.e. ) of the normal distribution of each geometric mean on the ACC model is illustrated by the distance from the horizontal bar to the cross marker in Figure 7. As we can see from the figure, the differences between the 1-IPC geometric means and the corresponding ACC values are at the same order of magnitude of the standard deviations of the ACC model. And sometimes the 1-IPC results even fall into the variation ranges of the ACC model. These observations all imply that the error induced by using 1-IPC model is comparable to the variation of ACC model.
Summary: We first showed that using the 1-IPC model instead of the ACC model did not significantly change the characteristics of the workloads. We then illustrated that the 1-IPC model could give qualitatively accurate results in the evaluation of different polices when the improvement is unambiguous. Furthermore, we demonstrated that the quantitative impact of using the 1-IPC model to compare LLC policies was at the same order as the impact of run-to-run variation when the cycle-accurate experiment cannot be run multiple times.
Vi Study 2: Scalability of Multithread Benchmarks
In this section, we evaluate the scalability of several multithread benchmarks using ACC and 1-IPC models with up to 16 cores.
Vi-a Simulator Improvement and Settings
We implemented 1-, 2-, 4-, 8-, and 16-core systems with 512KB, 1MB, 2MB, 4MB and 8MB L2 caches respectively. All other parameters for the core model and main memory are the same as Table I.
These simulators were each implemented on a single VC707 board, and for systems larger than the 4-core system, it becomes a challenge to implement these larger systems in the framework of Arete. This is because Arete translates each hardware module in the processor into a LI-BDN node, which takes multiple cycles to simulate the behavior of the original module in one cycle. Therefore when Arete models a multicore system with cores, it replicates the LI-BDN nodes of the core model for times. Due to the resource constraints of the VC707 board, it is impossible to fit more than four core models on a single FPGA. The solution Arete employs is to map the design to a multi-FPGA board. Unfortunately, we do not have a multi-FPGA board. Handling the hardware for inter-FPGA communication requires considerate engineering work, and the inter-board communication latency ends up being high.
Our solution is to apply fine-grained time-division multiplexing  to Arete’s LI-BDNs. Namely one LI-BDN node will model the same circuit module in multiple cores by simulating the functionality of each core’s module one by one. The design avoids deadlock due to the absence of combinational paths between cores in the target architecture. In this way, we can save FPGA resources because the logic for modeling functionality inside the core can be reused. One thing we do have to add for time-division multiplexing is registers for each state in the target architecture – such as the PC, registers in the pipeline FIFOs, etc. – so a single state register will be expanded to a vector of registers, each of which corresponds to the state register in one core.
Vi-B Multithread Benchmarks
|Benchmark suite||Benchmark name|
|PARSEC-3.0||blackscholes, canneal, fluidanimate, streamcluster|
Vi-C Results and Analysis
We ran each of the 6 benchmarks to completion with 1, 2, 4, 8 and 16 cores, and we measured the execution time to calculate the speed up normalized to single-core performance for each model. Figure 8 shows the scalability of each benchmark measured from the ACC core model and the 1IPC core model.
We observe that almost all of the scalability curves for the 1-IPC model match the corresponding curves from the ACC model perfectly except for the fluidanimate benchmark running on 16 cores. Currently we haven’t found the reason for this single mismatch and we are still investigating it.
One interesting observation is that in our experiment the 1-IPC model is able to accurately capture the scalability of the water_nsquare benchmark while Carlson et al. report in  that the 1-IPC model fails to capture the scalability of this benchmark on multicore systems with out-of-order cores. We believe this is because our ACC core model is an in-order pipeline, significantly different from an out-of order core.
Vii Study 3: Branch Predictor
In this section, we evaluate the following three branch predictors on a single core using the 1-IPC and ACC models:
the tournament branch predictor  from Alpha 21264;
the path-based neural branch predictor  with ahead pipelining; and
The sizes of the storage used by three branch predictors are all around 4KB.
Vii-a Benchmarks and Measurement
We evaluate these three branch predictors on a single core with 512KB L2 cache using 5 benchmarks from SPEC CINT2006 benchmark suite, which could incur high misprediction rates, as shown in Table VIII. All benchmarks are run to completion with the test input and we measure the misprediction rate only for conditional branches (excluding indirect jumps).
|Benchmark suite||Benchmark name|
|SPEC CINT2006||gobmk, hmmer, mcf, omnetpp, sjeng|
Vii-B Results and Analysis
Figure 9 shows the misprediction rates for all three branch predictors measured on the ACC and 1-IPC models. The results from the 1-IPC model matches those from the ACC model extremely well. We believe the similarity is because there are only three pipeline stages between branch prediction and resolution, and the probability of multiple outstanding unresolved branches is quite low. Therefore, the 1-IPC model – which can resolve branch in the same cycle as making prediction – is not significantly different from the ACC model in terms of training the branch predictor.
Simplifications are used often in computer architectural simulation to reduce the simulation time. By running a cycle-accurate full-system simulator side-by-side with a version of the same simulator that uses a 1-IPC core model, we are able to isolate and measure the effects of the 1-IPC core model simplification. We find that, although the 1-IPC model does not report accurate absolute metric values, the relative behavior of the 1-IPC model matches that of the ACC model.
First, by normalizing metric results across 15 workloads for three LLC cache replacement policies, we showed that the 1-IPC core model does not distort the characteristics of the results across the workloads for each replacement policy. Further exploring the LLC experiment results we showed that using the 1-IPC core model to make comparisons between replacement policies resulted in the correct comparisons most of the times, and when the comparisons were not correct, it was often due to brittle policies. Third, by running multicore benchmarks on various numbers of cores, we showed that the 1-IPC model appropriately matches the scaling trends shown displayed by the ACC model. Finally, by comparing three branch predictors across the two models, we showed that the 1-IPC model matched the branch prediction accuracy of the ACC model.
We find that the simplified 1-IPC core model is useful to produce qualitative comparisons between architectural configurations, but this is not a suggestion to ignore cycle-accurate models in favor of simplified 1-IPC models. This is merely an invitation to design simplified core models in parallel with cycle accurate models to prove their usefulness before switching to the simplified core model for experiments.
We thank Asif Khan for his pioneering work in studying the problem of simulation accuracy, and his great help in using the Arete simulator.
-  A. Khan, M. Vijayaraghavan, S. Boyd-Wickizer, et al., “Fast and cycle-accurate modeling of a multicore processor,” in Performance Analysis of Systems and Software (ISPASS), 2012 IEEE International Symposium on, pp. 178–187, IEEE, 2012.
-  M. M. Martin, D. J. Sorin, B. M. Beckmann, M. R. Marty, M. Xu, A. R. Alameldeen, K. E. Moore, M. D. Hill, and D. A. Wood, “Multifacet’s general execution-driven multiprocessor simulator (gems) toolset,” ACM SIGARCH Computer Architecture News, vol. 33, no. 4, pp. 92–99, 2005.
-  N. L. Binkert, R. G. Dreslinski, L. R. Hsu, K. T. Lim, A. G. Saidi, and S. K. Reinhardt, “The m5 simulator: Modeling networked systems,” IEEE Micro, vol. 26, no. 4, pp. 52–60, 2006.
-  A. Patel, F. Afram, S. Chen, and K. Ghose, “Marss: a full system simulator for multicore x86 cpus,” in Proceedings of the 48th Design Automation Conference, pp. 1050–1055, ACM, 2011.
-  E. Argollo, A. Falcón, P. Faraboschi, M. Monchiero, and D. Ortega, “Cotson: infrastructure for full system simulation,” ACM SIGOPS Operating Systems Review, vol. 43, no. 1, pp. 52–61, 2009.
-  A. Jaleel, R. S. Cohn, C.-K. Luk, and B. Jacob, “Cmp$im: A pin-based on-the-fly multi-core cache simulator,” in Proceedings of the Fourth Annual Workshop on Modeling, Benchmarking and Simulation (MoBS), co-located with ISCA, pp. 28–36, 2008.
-  J. E. Miller, H. Kasture, G. Kurian, C. Gruenwald, N. Beckmann, C. Celio, J. Eastep, and A. Agarwal, “Graphite: A distributed parallel simulator for multicores,” in High Performance Computer Architecture (HPCA), 2010 IEEE 16th International Symposium on, pp. 1–12, IEEE, 2010.
-  T. E. Carlson, W. Heirman, and L. Eeckhout, “Sniper: exploring the level of abstraction for scalable and accurate parallel multi-core simulation,” in Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, p. 52, ACM, 2011.
-  D. Sanchez and C. Kozyrakis, “Zsim: fast and accurate microarchitectural simulation of thousand-core systems,” in Proceedings of the 40th Annual International Symposium on Computer Architecture, pp. 475–486, ACM, 2013.
-  C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood, “Pin: building customized program analysis tools with dynamic instrumentation,” ACM Sigplan Notices, vol. 40, no. 6, pp. 190–200, 2005.
-  D. Genbrugge, S. Eyerman, and L. Eeckhout, “Interval simulation: Raising the level of abstraction in architectural simulation,” in High Performance Computer Architecture (HPCA), 2010 IEEE 16th International Symposium on, pp. 1–12, IEEE, 2010.
-  T. Nowatzki, J. Menon, C.-H. Ho, and K. Sankaralingam, “gem5, gpgpusim, mcpat, gpuwattch, ”your favorite simulator here” considered harmful,” in 11th Annual Workshop on Duplicating, Deconstructing and Debunking, 2014.
-  J. Gibson, R. Kunz, D. Ofelt, M. Horowitz, J. Hennessy, and M. Heinrich, “Flash vs.(simulated) flash: Closing the simulation loop,” in ACM SIGARCH Computer Architecture News, vol. 28, pp. 49–58, ACM, 2000.
-  R. Desikan, D. Burger, and S. W. Keckler, “Measuring experimental error in microprocessor simulation,” in Proceedings of the 28th annual international symposium on Computer architecture, pp. 266–277, ACM, 2001.
-  H. W. Cain, K. M. Lepak, B. A. Schwartz, and M. H. Lipasti, “Precise and accurate processor simulation,” in Workshop on Computer Architecture Evaluation using Commercial Workloads, HPCA, vol. 8, 2002.
-  O. Mutlu, H. Kim, D. N. Armstrong, and Y. N. Patt, “Understanding the effects of wrong-path memory references on processor performance,” in Proceedings of the 3rd workshop on Memory performance issues: in conjunction with the 31st international symposium on computer architecture, pp. 56–64, ACM, 2004.
-  R. Sendag, A. Yilmazer, J. J. Yi, and A. K. Uht, “Quantifying and reducing the effects of wrong-path memory references in cache-coherent multiprocessor systems,” in Parallel and Distributed Processing Symposium, 2006. IPDPS 2006. 20th International, pp. 10–pp, IEEE, 2006.
-  J. J. Yi, S. V. Kodakara, R. Sendag, D. J. Lilja, and D. M. Hawkins, “Characterizing and comparing prevailing simulation techniques,” in High-Performance Computer Architecture, 2005. HPCA-11. 11th International Symposium on, pp. 266–277, IEEE, 2005.
-  T. Sherwood, E. Perelman, G. Hamerly, and B. Calder, “Automatically characterizing large scale program behavior,” ACM SIGARCH Computer Architecture News, vol. 30, no. 5, pp. 45–57, 2002.
-  R. E. Wunderlich, T. F. Wenisch, B. Falsafi, and J. C. Hoe, “Smarts: Accelerating microarchitecture simulation via rigorous statistical sampling,” in Computer Architecture, 2003. Proceedings. 30th Annual International Symposium on, pp. 84–95, IEEE, 2003.
-  M. Vijayaraghavan et al., “Bounded dataflow networks and latency-insensitive circuits,” in Formal Methods and Models for Co-Design, 2009. MEMOCODE’09. 7th IEEE/ACM International Conference on, pp. 171–180, IEEE, 2009.
-  R. E. Kessler, E. J. McLellan, and D. A. Webb, “The alpha 21264 microprocessor architecture,” in Computer Design: VLSI in Computers and Processors, 1998. ICCD’98. Proceedings. International Conference on, pp. 90–95, IEEE, 1998.
-  A. Jaleel, W. Hasenplaugh, M. Qureshi, J. Sebot, S. Steely Jr, and J. Emer, “Adaptive insertion policies for managing shared caches,” in Proceedings of the 17th international conference on Parallel architectures and compilation techniques, pp. 208–219, ACM, 2008.
-  M. K. Qureshi, A. Jaleel, Y. N. Patt, S. C. Steely, and J. Emer, “Adaptive insertion policies for high performance caching,” in ACM SIGARCH Computer Architecture News, vol. 35, pp. 381–391, ACM, 2007.
-  A. Jaleel, K. B. Theobald, S. C. Steely Jr, and J. Emer, “High performance cache replacement using re-reference interval prediction (rrip),” in ACM SIGARCH Computer Architecture News, vol. 38, pp. 60–71, ACM, 2010.
-  “Data-intensive systems stressmark suite.” http://www.ics.uci.edu/~amrm/hdu/DIS_Stressmark/DIS_stressmark.html.
-  J. D. McCalpin, “A survey of memory bandwidth and machine balance in current high performance computers,” IEEE TCCA Newsletter, pp. 19–25, 1995.
-  V. H. Franz, “Ratios: A short guide to confidence limits and proper use,” arXiv preprint arXiv:0710.2024, 2007.
-  M. Pellauer, M. Adler, M. Kinsy, A. Parashar, and J. Emer, “Hasim: Fpga-based high-detail multicore simulation using time-division multiplexing,” in High Performance Computer Architecture (HPCA), 2011 IEEE 17th International Symposium on, pp. 406–417, IEEE, 2011.
-  C. Bienia, S. Kumar, J. P. Singh, and K. Li, “The parsec benchmark suite: Characterization and architectural implications,” in Proceedings of the 17th international conference on Parallel architectures and compilation techniques, pp. 72–81, ACM, 2008.
-  S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta, “The splash-2 programs: Characterization and methodological considerations,” in ACM SIGARCH Computer Architecture News, vol. 23, pp. 24–36, ACM, 1995.
-  “Splash-2x benchmark suite.” http://parsec.cs.princeton.edu/parsec3-doc.htm#splash2x.
-  D. A. Jiménez, “Fast path-based neural branch prediction,” in Microarchitecture, 2003. MICRO-36. Proceedings. 36th Annual IEEE/ACM International Symposium on, pp. 243–252, IEEE, 2003.
-  A. Seznec and P. Michaud, “A case for (partially)-tagged geometric history length predictors,” Journal of Instruction-Level Parallelism (JILP), vol. 8, 2006.
-  A. Seznec, “Tage-sc-l branch predictors.” http://www.jilp.org/cbp2014/code/AndreSeznec.tar.gz, 2014.