Memory Controller Design Under Cloud Workloads

11/30/2016
by   Mostafa Mahmoud, et al.
UNIVERSITY OF TORONTO
0

This work studies the behavior of state-of-the-art memory controller designs when executing scale-out workloads. It considers memory scheduling techniques, memory page management policies, the number of memory channels, and the address mapping scheme used. Experimental measurements demonstrate: 1) Several recently proposed memory scheduling policies are not a good match for these scale-out workloads. 2) The relatively simple First-Ready-First-Come-First-Served (FR-FCFS) policy performs consistently better, and 3) for most of the studied workloads, the even simpler First-Come-First-Served scheduling policy is within 1% of FR-FCFS. 4) Increasing the number of memory channels offers negligible performance benefits, e.g., performance improves by 1.7% on average for 4-channels vs. 1-channel. 5) 77%-90% of DRAM rows activations are accessed only once before closure. These observation can guide future development and optimization of memory controllers for scale-out workloads.

READ FULL TEXT VIEW PDF

page 6

page 7

page 8

page 9

05/22/2018

Storage and Memory Characterization of Data Intensive Workloads for Bare Metal Cloud

As the cost-per-byte of storage systems dramatically decreases, SSDs are...
09/23/2018

OS Scheduling Algorithms for Memory Intensive Workloads in Multi-socket Multi-core servers

Major chip manufacturers have all introduced multicore microprocessors. ...
01/16/2020

Duet Benchmarking: Improving Measurement Accuracy in the Cloud

We investigate the duet measurement procedure, which helps improve the a...
08/26/2015

EOS: Automatic In-vivo Evolution of Kernel Policies for Better Performance

Today's monolithic kernels often implement a small, fixed set of policie...
07/06/2022

Are Updatable Learned Indexes Ready?

Recently, numerous promising results have shown that updatable learned i...
08/21/2021

Programmable FPGA-based Memory Controller

Even with generational improvements in DRAM technology, memory access la...
03/01/2022

First-generation Memory Disaggregation for Cloud Platforms

In Azure, up to 25 servers' cores have been rented to VMs. Memory disagg...

1 Introduction

Today there is an increasing demand for cloud services, such as media streaming, social networks and search engines. Cloud services providers have been expanding their computing infrastructure adding more data centers comprising many computing systems. The performance and power consumption of these systems dictates how capable these data centers are. Accordingly, improving the performance and power of these systems can greatly improve overall data center efficiency. The first step in improving these systems is understanding their behavior in order to identify inefficiencies and opportunities for improvement.

The majority of modern server processors are built using the same microarchitecture as the one used for consumer designs with minor tweaks. There is already evidence suggesting that these architectures are suboptimal and that simpler processors tend to perform better for scale-out workloads [1, 2, 3]. Ferdman et al. did an extensive study and instrumentation of CloudSuite, a representative set of cloud applications [4] and discovered significant over-provisions, at the level of microarchitectural structures and cache and memory hierarchy, in currently widely used server processors. Their study focused on microarchitectural characteristics and identified several inefficiencies including long instruction-fetch stalls, under-utilized instruction-level parallelism (ILP) and memory-level parallelism (MLP) structures, last-level cache (LLC) ineffectiveness, over-sized L2 cache, reorder buffer and load-store queue under-utilization and off-chip bandwidth over-provisioning. Our work builds upon this past study and focuses on the memory controller, a component that has not been covered in detail yet.

Specifically, this work studies the implications of scale-out workloads characteristics on memory controller design aspects including memory scheduling techniques, DRAM page management policies, the number of memory channels, and the indexing scheme used to access memory channels. Over the last decade, researchers and processor vendors have been developing even more sophisticated on-chip memory controllers that consider a wide range of system status attributes in scheduling decisions [5, 6, 7, 8, 9, 10]. Furthermore, there has been a trend to increase the number of memory channels with four on-chip channels being available on high-end designs [11, 12]. Increasing the number of memory channels offers more parallelism and allows more memory to be connected to the same processor chip.

This work characterizes the performance of several state-of-the-art memory controller designs when running several modern server workloads. The emphasis is on the CloudSuite benchmarks which are the best available to us workloads representing modern cloud applications. This study also considers traditional server workloads from SPECweb99, online transaction processing (OLTP) and decision support applications for completeness and comparison purposes. The following memory scheduling algorithms are studied: ATLAS [6], PAR-BS [7]

and Reinforcement Learning

[10]. Section 2.1 reviews these policies.

The following key conclusions are drawn:

  • The First-Ready First-Come-First-Served (FR-FCFS) algorithm [5] outperforms the other state-of-the-art memory scheduling algorithms for all the studied server workload categories. These other techniques were designed for desktop and scientific applications with abundant memory level parallelism (MLP) or for multi-programmed heterogeneous memory-intensity workloads. The server workloads studied do not exhibit these characteristics, and thus memory controller design ought to be revisited to better match these workloads.

  • For five out of the six studied scale-out workloads, the performance of the simpler First-Come-First-Served (FCFS) memory scheduling algorithm is within 1% of the baseline FR-FCFS. Accordingly, using this simpler policy may be best in some cases.

  • Out of all DRAM row activations, 77%-90% receive only a single access before they are being forced to close. Leaving a row open till a conflict happens increases memory access latency for subsequent requests whereas proactively closing a row that will not receive any further hits reduces it. Accordingly, smarter page management policies may improve overall memory access latency.

  • State-of-the-art page management policies degrade performance by 4% for scale-out workloads but did improve performance by 3% for decision support workloads.

  • A multi-channel memory controller does not improve the performance for scale-out workloads but does benefit decision support workloads. Specifically, decision support workloads performance improves on average by 19% on a 4-channel system compared to a single-channel one.

Overall, the results of this study emphasize the unique off-chip memory access demands of scale-out workloads and suggest that simpler memory controllers specialized for cloud workloads may be beneficial. A limitation of this study is that it does not directly consider energy and power consumption focusing primarily on performance. Future work may address this limitation. However, the results of this work are valuable as the techniques that perform best at the end are also the simplest to implement and hence would also reduce overall energy and power consumption. The rest of this paper is organized as follows. Section 2 reviews state-of-the-art memory controller designs including memory scheduling techniques and page management policies. Section 3 presents the experimental methodology. Section 4 reports our findings. Section 5 discusses some of the limitations of this study. Section 6 summarizes related work and Section 7 concludes.

2 Background

This section reviews the two core memory controller components that this study characterizes: the scheduling algorithm (Section 2.1) and the memory page management policy, or page management policy (Section 2.2). The former decides which memory request to service next while the latter decides after servicing a request, whether to keep the row open or precharge it, indirectly affecting the service time of any subsequent requests.

2.1 Memory Scheduling Algorithms

A memory scheduling algorithm (MSA) specifies how the memory controller decides the next request to service given a pool of outstanding requests. The MSA’s inputs are at the very least the state of the DRAM banks and buses, and a pool of waiting requests. Additional statistics or system attributes may also be used to guide the decision making. MSAs can target different objectives such as memory throughput or fairness among running threads. The MSA affects the waiting time each request experiences before being serviced. Since main memory is a shared resource, requests from all cores contend and the MSA affects overall system and individual thread performance.

The FCFS scheduling algorithm services requests in the order they arrive at the memory controller. FCFS does not exploit row-buffer locality as it does not try to reorder requests to increase the row-buffer hit rate. If cores generate memory requests at different rates, cores with a low memory-intensity access stream can suffer from starvation and long memory access latencies. FCFS is the simplest memory scheduling technique and its hardware overhead and power consumption are the lowest among those we study. We evaluate FCFS_banks, a variant of FCFS that maintains separate per-bank request queues and thus can exploit bank-level parallelism.

FR-FCFS [5] separates requests into two groups depending on whether they will hit in the currently open row, and further, within each group it orders requests according to age. It prioritizes hits over misses, and older over younger requests. FR-FCFS’s goal is to increase memory throughput. FR-FCFS can lead to starvation and low overall throughput when there is imbalance in the row-buffer locality of the access streams across cores [13, 9, 7].

Parallelism-Aware Batch Scheduling (PAR-BS) [7] targets maintaining fairness among cores and reducing the average stall time through Batching and Ranking. Batching groups the oldest requests from each core into a batch that is prioritized over the remaining requests. After batch formation, the cores within the batch are ranked using a shortest-job first criterion where the core with the minimum number of requests to any bank is considered the core with the shortest job. Ranking minimizes the average waiting time across all cores.

Adaptive per-Thread Least-Attained-Service memory scheduling (ATLAS) [6] is based on the observation that cores that require less memory service time are more vulnerable to interference from cores that require more. ATLAS divides time into 10M cycle quantums tracking the total memory service time used by each core during each quantum as follows: every cycle, the attained service time (ATS) of a core is increased by the number of banks servicing its requests. At the start of the subsequent quantum, the memory controller ranks the cores according to their ATS, ranking those with less ATS higher. This policy exploits bank-level parallelism and reduces average latency across cores. ATLAS prevents starvation by prioritizing requests that have been waiting for more than a threshold cycles.

Reinforcement Learning-based memory scheduling algorithm (RL) [10] uses reinforcement learning [14] resulting in a self-optimizing memory scheduler. RL uses a number of attributes to represent the current system state (s) such as the number of reads in the request queue, the number of writes and the number of load requests. The action (a) that it can perform at a given cycle is precharge, activate, write, read for a load miss, read for a store miss or no-action. For every possible system state-action pair , RL associates a Q-value that represents the expected future reward if action is executed in state . Given a current state, RL’s goal is to maximize the future rewards by choosing the action with the highest Q-value

. RL uses an area and energy efficient implementation of the aforementioned reinforcement learning algorithm. With a preset probability, the algorithm decides to execute a random action instead of the one with the highest

Q-value to explore new search space areas.

The Fair Queuing Memory Scheduler (FQM) [8] is a memory scheduling algorithm based on a computer network fair queuing algorithm. FQM’s goal is to provide equal memory bandwidth to all cores. In FQM, each bank in the DRAM keeps a virtual time counter per core that is increased every time a request from the corresponding core is serviced by that bank. Each bank prioritizes the core with the earliest virtual time since this is the core that got the least service from that bank. Later proposals outperformed FQM [7, 9] as the latter does not consider the long term intensity of the memory access stream per core and does not attempt to maximize row hits. For this reason, we do not consider it further.

2.2 Page Management Policies

The page management policy decides for how long a memory row or page should remain open. One such policy is the open-page management policy (O) where a row will be closed only if another request forces it to do so. The O tends to be a good match for single-core systems whose memory access stream often exhibits good spatial locality. This way, subsequent requests hit in the row-buffer and thus experience lower latency.

The memory controller of many-core systems observes the interleaving of several access streams and as such may not exhibit sufficient spatial locality. In these systems, row hits are less likely and waiting to close the page only when necessary ends up adding to the latency of the next request. In such a case, it is best to close the page as soon as possible [15]. Accordingly, the close-page management policy (C) immediately closes a row after accessing it.

The C is not free of trade offs. It suffers excess delays and wasted power when there is some locality in the interleaved access streams. Adaptive page policies such as the open-adaptive (OA) and close-adaptive (CA) have been proposed to improve upon O and C. The OA policy closes the row only when: 1) there are no more pending requests in the controller’s queue that would hit in the open row, and 2) there are pending requests to another row. The CA policy closes a row as soon as there are no pending requests that would hit in the same row. Thus, OA speculates that near future requests are likely to be hits.

Neither policy is always best. Accordingly, later work proposes switching between the two on-the-fly. Awasthi et al. introduced the Access-Based Page Policy (ABPP) for multi-core systems [16]. ABPP assumes that a row will receive the same number of hits as the last time it was activated. The implementation uses per-bank tables recording the most recently accessed rows and the number of hits they received last time. The tables are used to predict how long a row should stay open and are dynamically updated as the program executes. In the absence of a table entry, a row stays open until a conflict forces it to close.

Shen et al. proposed the Row-Based Page Policy (RBPP) which has lower hardware overhead than ABPP [15]. RBPP exploits the observation that in a short period of time most of the memory requests are for a small number of rows even in a many-core system with different applications running on each core. RBPP uses a few most-accessed-row registers (MARR) per bank recording the number of hits received by recently accessed rows that have received at least one hit.

Timer-based page closure precedes RBBP and ABBP [17, 18] and predicts how long a row should stay open. The various timer-based policies differ in the granularity of the timer, e.g., some maintain a timer per bank while others maintain a global timer. Other approaches used a branch prediction-like two-level predictor to decide whether to close an open row-buffer [19, 20]. We opted not to include these proposals in our study since they were proposed for single core systems and would have to be adapted for many-core systems, moreover, RBPP and ABPP outperformed them [15].

3 Methodology

3.1 Workloads

Our study focuses on the scale-out server workloads represented by the CloudSuite benchmark suite [4]. CloudSuite includes Data Serving, MapReduce, SAT Solver, Web Frontend, Web Search and Media Streaming applications. To evaluate and compare these emerging workloads to traditional server workloads, we extended our experiments to include two traditional workload categories: 1) Transactional workloads including SPECweb99 and TPC-C. We ran TPC-C on two commercial database management systems. 2) Decision support workloads represented by three TPC-H queries; Q2, Q6 and Q17. The three queries cover a wide range of select-intensive, join-intensive and select-join queries.

For all the workloads, we use the same benchmarks configurations used by Ferdman et al. [4]. Table I reports the workloads, categories and acronyms we use in the rest of this paper.

Category Category acronym Workload Acronym
Scale-out SCO Data Serving DS
MapReduce MR
SAT Solver SS
Web Frontend WF
Web Search WS
Media Streaming MS

 

TRS SPECweb99 WSPEC99
Transactional TPC-C1 (vendor A) TPC-C1
TPC-C2 (vendor B) TPC-C2

 

DSP TPC-H Q2 TPCH-Q2
Decision Support TPC-H Q6 TPCH-Q6
TPC-H Q17 TPCH-Q17
TABLE I: Categorized Workloads and Abbreviations

3.2 Simulation

This study used the Virtutech Simics functional simulator to model a system of 16-core chip-multiprocessor (CMP) unless otherwise noted. The GEMS simulator [21] was used to extend Simics with an on-chip network timing model and a Ruby-based full-memory hierarchy including the memory controller and off-chip DRAM timing models. The cores run the SPARC v9 ISA.

We follow the SimFlex multiprocessor sampling technique for our simulation experiments [22]. The samples are taken over a 10-second simulated interval of each application. Each sample ran for six billion user-level instructions where the first one billion user-level instructions were used to warm up the system, e.g., the caches, memory queues, network buffers, and interconnects. Statistics were collected for the subsequent five billion user-level instructions. TPC-H queries Q2 and Q17 were run to completion for a total run length of roughly 2.5 billion user-level instructions, one billion of which was used to warm up. We simulate both user-level and operating system-level instructions but we use the user-level instruction count divided by the total simulated cycles count as an indicator of the overall system performance; a metric found by Wenisch et al. to accurately represent system throughput [22]. Measured performance metrics include committed user-level instructions per cycle (user IPC), average memory access latency, DRAM row-buffer hit rate, LLC misses per kilo instructions (L2 MPKI), and memory bandwidth utilization. The Web Frontend benchmark uses only 8-cores in the configuration that was available to us.

3.3 Baseline System Configuration

The baseline system is a 16-core in-order CMP with only two levels of on-chip caches based on the scale-out processor design recommendations proposed by Lotfi-Kamran et al. [3]. In that study, pods of 16-32 in-order cores were found to achieve the highest performance density for scale-out workloads. The chip features a modestly sized 4MB LLC to capture the instruction working set and shared OS data. Table II details the architectural configurations, and Section 4.3 explains the address mapping abbreviation listed.

CMP Organization 16-core Scale-Out Processor pod
Core In-order @ 2GHz
L1-I/D caches 32KB each, 64B blocks, 2-way
Shared L2 cache 4MB, unified, 16-way, 64B blocks, 4 banks
Interconnect 16x4 crossbar
Memory Controller FR-FCFS scheduling, open-adaptive page policy, 1-channel, 11.9GB/s bandwidth, RoRaBaCoCh address mapping
Off-chip DRAM 32-64GB, DDR3-1600 (800MHz), 2 ranks, 8 banks per rank, 8KB row-buffer
--- 11-11-11-28
--- 39-12-6-6
- (in cycles) 5-24
TABLE II: Baseline System Configuration

4 Results

Section 4.1 compares the memory scheduling techniques in terms of overall system performance, average memory access latency and row-buffer hit rate. The observations in Section 4.2.1 motivate the study of Section 4.2 which identifies the efficiency of DRAM page management policies in predicting when to close a row-buffer and when to keep it open. Section 4.3 investigates the effectiveness of using multi-channel memory controllers.

Algorithm Parameter Value
PAR-BS Batching-Cap 5

 

ATLAS Quantum 10M cycles
(bias to current quantum) 0.875
Starvation threshold 50K cycles

 

RL # of Q-value tables 32
Q-value table size 256 Q-values
(learning rate) 0.1
(discount rate) 0.95
(random action probability) 0.05
Starvation threshold 10K cycles
TABLE III: Scheduling Algorithms Configurations

4.1 Memory Scheduling Study

We evaluate the baseline FR-FCFS scheduling algorithm [5, 23, 24] as well as the state-of-the-art memory scheduling techniques PAR-BS [7], ATLAS [6] and RL [10]. We also evaluate the simplest algorithm FCFS_banks. Results are normalized to the baseline FR-FCFS unless otherwise stated. Table III shows the configurations used for PAR-BS, ATLAS and RL.

4.1.1 Performance

Fig. 1: User IPC normalized to FR-FCFS.

Figure 1 shows the user IPC normalized to FR-FCFS. The results show that, under server workloads and especially SCO applications, FR-FCFS outperforms PAR-BS, ATLAS and RL. The only exceptions are TPC-C2 and Media Streaming where the aforementioned policies perform as well as FR-FCFS.

Performance with ATLAS is lower more so for SCO that suffers a 20% average drop in performance compared to FR-FCFS. ATLAS with its long quantum period of 10 million cycles causes some cores to be unfairly given lower priority for long periods. For example, in MapReduce and Web Frontend some cores had 50% lower IPC than others, resulting in overall performance degradation of 52% and 21% respectively. By comparison, the lowest per core IPC with FR-FCFS is within 85% of the highest per core IPC for these two applications. Large IPC disparity across cores is also responsible for the 33% performance loss of SPECweb99. On average, ATLAS achieves average performance that is 20%, 12% and 10% lower for SCO, TRS and DSP respectively.

RL also performs worse than FR-FCFS, more so for DSP where the loss is 10%. DSP’s access patterns tend to be more random than that of conventional desktop and scientific applications and thus challenge the RL exploration process which introduces overheads if activated frequently enough.

Fig. 2: Row-buffer hit rate.

The simpler FCFS_Banks performs closer to FR-FCFS than the other policies. FCFS_Banks’ average performance is within 6%, 3% and 4% for SCO, TRS and DSP respectively. For five out of the six SCO workloads, FCFS_Banks matches the performance of FR-FCFS. This can be explained by Figure 2 which shows that the row-buffer hit rate changes by only -4%, +1% and -2% for SCO, TRS and DSP respectively with FCFS_Banks. FR-FCFS’ benefit over FCFS_Banks is that it favors row hits. For these workloads, the cores are either not concurrently competing for the same bank or they access the same row in the same bank at the same time. Accordingly, there is little benefit from reordering the requests heading to the same bank. Web Frontend is the exception as its performance is 37% lower with FCFS_Banks. This workload exhibits a row-buffer hit rate of 55% with FR-FCFS which drops to 45% with FCFS_Banks leading to a 17% increase in average memory access latency as  Figure 2 and Figure 3 show. Compared to FR-FCFS, FCFC_Banks does not need to scan all the request queues every cycle searching for a row hit to promote. As a result, it is simpler to implement and would require less energy. However, overall system energy may increase unless performance stays the same.

Fig. 3: Average memory access latency normalized to FR-FCFS.

4.1.2 Memory Access Latency Sensitivity

The average memory access latency in Figure 3 correlates with the changes in user IPC. However, the sensitivity of performance to memory access latency differs across workloads. For example, DSP is less sensitive to average memory latency than the Web Frontend workload; while DSP suffers a 15% increase in memory access latency with FCFS_Banks this translates to a 4% reduction in IPC. DSP exhibits some memory level parallelism which hides most of the increase in memory access latency under FCFS_Banks.

Fig. 4: L2 Misses per kilo user instructions (MPKI).

ATLAS suffers a significant increase in memory access latency which is more pronounced for SCO at 2.94x on average and up to 7.78x for MapReduce. This increase in memory latency translates into a 20% performance loss on average. DSP is also negatively impacted by ATLAS’ long-term ranking scheme and RL’s exploratory approach leading to 28% and 37% longer memory access latency respectively.

The MPKI measurements in Figure 4 indicate that SCO and TRS have relatively low memory intensity exhibiting an average L2 MPKI of 5 and 8 respectively. DSP exhibits a higher average MPKI of around 18. This corroborates the different off-chip memory bandwidth demands of scale-out workloads reported in earlier work [4].

4.1.3 Requests Queue Length

Fig. 5: Average read queue length.

The memory controller’s average read and write queue occupancies (length) are shown in Figure 5 and Figure 6 respectively. All memory scheduling techniques never needed more than a 10-entry read queue and a 50-entry write queue. On average, DSP is more demanding than SCO with MapReduce under ATLAS being the exception. MapReduce’s behavior is due to the 7.78x increase in memory access latency as discussed in Section 4.1.2. The observed queue lengths are far lower than those used in previous work [25, 26, 27]. This may be due to the relatively simple in-order cores used here.

Fig. 6: Average write queue length.

RL exhibits noticeably lower write queue lengths compared to the rest. The other techniques are switching from a read phase to a write phase only when the write queue length is above a certain threshold in order to reduce how often the data bus direction switching penalty is incurred. RL considers both reads and writes when it selects the memory request to serve next and builds its decision based on the optimum strategy it learned so far. This gives the memory controller the freedom to serve write requests whenever it can steal a few cycles between critical memory reads.

Fig. 7: Memory bandwidth utilization.

4.1.4 Off-Chip Bandwidth Utilization

Figure 7 shows the off-chip bandwidth utilization per workload. Utilization under SCO ranges from 14% and up to 50% of the available peak bandwidth with an average of 34%. TRS exhibits a similar average while DSP has higher off-chip access demands with an average utilization of 54%. The measured average memory bandwidth utilization motivates the study in Section 4.3 that considers the impact of introducing additional memory channels.

4.1.5 Summary of Memory Scheduling Study Results

A technique as simple as FR-FCFS has proven best for the workloads studied under the pod-based in-order processor design proposed by Lotfi-Kamran et al. [3]. For the server workloads, and more so for scale-out workloads, FR-FCFS outperformed the other state-of-the-art memory scheduling techniques. These techniques were designed to outperform FR-FCFS for desktop and parallel applications such as SPEC CPU2006 [28] and SPLASH-2 [29] in environments where fairness among threads or cores is a concern, which is not the case here. Scale-out workloads exhibit relatively low MLP and different access patterns compared to scientific applications. Moreover, the cores used here are relatively simple as advised by Lotfi-Kamran et al. who found that those workloads do not exhibit similarly high ILP as SPEC and PARSEC do. Combined these explain the differences in relative performance across the various memory scheduling policies. Even a simple FCFS scheduler that exploits bank-level parallelism is within 6% of FR-FCFS in all cases and matches its performance for most.

The analysis showed that relatively short read and write queues are sufficient for these workloads primarily due to the relatively low MLP. Finally, off-chip memory bandwidth utilization was shown to be relatively low suggesting that adding additional memory channels may not be needed if performance is the only consideration.

4.2 Page Management Policies Study

4.2.1 Activated Row-Buffer Reuse

Fig. 8: Percentage of single-access row-buffer activations under OA.

The baseline OA policy is often preferred for workloads that exhibit significant row-buffer locality. As Figure 2 shows, with this policy, the average row-buffer hit rates are relatively low at 37%, 33% and 27.5% for SCO, TRS and DSP respectively. To reason about the access patterns of those workloads that might result in such row-buffer hit rates, we studied histograms of how frequently an activated row-buffer is reused before precharging it. As Figure 8 shows, all workloads access 77%-90% of their activated rows only once. Keeping a row open as long as possible thus will often prolong subsequent accesses that will have to first wait for the row to be closed. This observation suggests that using the CA policy might be better. CA

precharges the row-buffer directly after the column access if no more hits are waiting in the queue. Having a high percentage of single-access row activations is not necessarily at odds with a high row-buffer hit rate. For example, in Media Streaming 76% of the row activations observe only a single access while the remaining 24% of the row activations experience a high number of hits.

Fig. 9: Row-buffer hit rate for different page management policies normalized to OA.

Figure 9 shows that CA exhibits much lower row-buffer hit rates which are below 6% for all workloads. This indicates that a significant fraction of the hits were due to the optimistic OA. One would expect that this reduction in row-buffer hits will hurt performance and memory access latency. In practice, the performance gained from closing the single-access rows under CA early compensates for the performance loss due to fewer row-buffer hits. Figure 10 shows that, under CA, the average memory access latency did not change for SCO, while it decreased by 4% and 13% for TRS and DSP. Performance was reduced by 2.5% for SCO but improved by 4% for DSP as Figure 11 shows. Web Frontend and Media Streaming that exhibited the highest row-buffer hit rates under OA, suffered a 15% increase in access latency and Web Frontend lost 20% of its performance. The relatively higher MPKI and MLP of Media Streaming limited its performance loss to 1%. Data Serving, MapReduce and SAT Solver saw improved access latencies by 8%, 12% and 9% and performance improvements in the range of 1% to 2%.

Fig. 10: Average memory access latency for different page management policies normalized to OA.
Fig. 11: User IPC for different page management policies normalized to OA.

4.2.2 Predictive Page Management Policies

The results of the previous section motivated us to study the behavior of state-of-the-art predictive page management policies. This section compares RBPP [15] and ABPP [16] to CA and OA policies in terms of memory access latency, row-buffer hits preservation and user IPC.

As Figure 9 shows, RBPP is preserving 70%, 75% and 86% of the row-buffer hits for SCO, TRS and DSP respectively. ABPP preserves generally less row-buffer hits more so for DSP. Figure 10 shows that DSP’s access latency is reduced by 6% with RBPP, leading to a 3% increase in user IPC in Figure 11. IPC for SCO and TRS degrades by 4% and 2% under RBPP. This correlates with the losses in row-buffer hits in Figure 9.

RBPP and ABPP policies favor capturing row-buffer hits over timely closure of single-access row activations. As a result, they do not avoid most of the penalty due to late closure of rows and they fail to capture some of the row hits when they prematurely close a row. Overall performance is equal or slightly less than OA. These policies were designed and tested for desktop applications such as SPEC CPU2006.

4.2.3 Summary of Page Management Policies Results

This section found that scale-out workloads exhibit a high percentage of single-access row activations. Thus, smarter page management policies are needed to timely close pages in order to achieve the following two goals: 1) Capturing as many row hits as possible, and 2) closing a page as early as necessary to avoid penalizing subsequent accesses to other rows.

4.3 Multi-Channel Memory Systems Study

Fig. 12: Normalized user IPC as the number of memory channels increases.

Modern server processors incorporate several memory channels [11, 12]. This section considers the performance impact of using multi-channel memory controllers, a study motivated by the relatively low memory bandwidth utilization of the studied workloads. Specifically, this section studies the impact of integrating 2 and 4 memory channels in terms of performance, memory access latency and row-buffer hit rate. The address mapping scheme, that is how physical addresses are mapped across the off-chip memory system, can impact overall performance. For this reason, we studied a number of address schemes that differ in which address bits they use to select the DRAM channel (Ch), column (Co), bank (Ba), rank (Ra) and row (Ro). The schemes studied are: RoRaBaCoCh (baseline mapping), RoRaBaChCo, RoRaChBaCo and RoChRaBaCo. In the interest of space, we report the results of the best performing scheme per workload and the averages. The best performing scheme for each workload is reported in Section IV

Workload 2-channel 4-channel
Data Serving RoRaBaChCo RoRaChBaCo
MapReduce RoRaChBaCo RoChRaBaCo
SAT Solver RoChRaBaCo RoRaChBaCo
Web Frontend RoChRaBaCo RoRaBaCoCh
Web Search RoRaChBaCo RoRaBaChCo
Media Streaming RoChRaBaCo RoChRaBaCo
WSPEC99 RoRaBaCoCh RoRaChBaCo
TPC-C1 RoRaBaChCo RoChRaBaCo
TPC-C2 RoRaChBaCo RoChRaBaCo
TPC-H Q2 RoRaBaChCo RoChRaBaCo
TPC-H Q6 RoRaChBaCo RoChRaBaCo
TPC-H Q17 RoRaBaChCo RoChRaBaCo
TABLE IV: The best performing multi-channel mapping scheme for each workload

The baseline scheme, RoRaBaCoCh, generally had the worst performance because accesses that would row hit in the 1-channel system may now map to a different channel; The scheme alternates successive cache blocks between the memory channels which means that sequential accesses do not map to the same DRAM row. As shown in Figure 13, RoRaBaChCo, RoRaChBaCo and RoChRaBaCo exhibit better row-buffer hit rates than the baseline system. Figure 12 shows that integrating more on-chip memory channels does not enhance SCO’s performance significantly and in some cases it hurts performance. The performance of Web Frontend dropped by 10% and 9% for the multi-channel systems. The highest gain of 6% and 8.5% for the 2- and the 4-channel configurations was observed for Data Serving. The average improvement for SCO was below 1% and 1.7% for the 2- and 4-channels systems. DSP behaves differently and exhibits 11.5% and 19% average performance improvements for the 2- and 4-channel systems. Finally, TRS also exhibits average performance improvements of 2.3% and 6% on the 2- and 4-channel systems.

Fig. 13: Normalized row-buffer hit rates as the number of memory channels increases.
Fig. 14: Normalized memory access latency as the number of memory channels increases.

Figure 13 and Figure 14 explain why performance did not improve for SCO and TRS in contrast to DSP. SCO and TRS improved the least in terms of row-buffer hit rates and memory access latencies. For both categories, the average row-buffer hit rate increased by 1.3x and 1.6x for the 2- and 4-channels systems. The average memory access latency of SCO decreased to 81% and 70% of the baseline system. TRS experiences similar reductions for the average memory access latency. Meanwhile, DSP’s average hit rates increased by 1.7x and 2.3x on the 2- and 4-channels systems respectively. Moreover, DSP’s memory access latency decreased to 64% and 47% of the baseline access latency. Along with the relatively higher MPKI of DSP, shown in Figure 4, this explains the difference in the IPC gains between DSP and SCO when multiple channels are used.

Regardless of the improvements in row-buffer hits and memory access latency, Web Frontend’s performance dropped by around 10%. This workload exhibited an 11% and 25% increase in the total number of memory accesses for the 2- and 4-channel systems. The extra memory references are mostly DMA/IO and atomic memory requests which hit in the row buffers reducing average latency. However, they increase contention and latency for critical accesses thus hurting performance. The reported user IPC, which is a metric of the user-level progress per cycle, is expected to go down due to more congestion from the extra accesses.

Conclusion: Under the pod-based in-order processor design proposed by Lotfi-Kamran et al. [3], integrating multi-channel memory controllers on-chip with higher die area and power consumption does not improve the performance for scale-out workloads. One channel can satisfy the low off-chip memory bandwidth demand of scale-out workloads under the studied pod configuration. However, increasing core count would lead to a higher demand for off-chip bandwidth which could benefit from multiple channels.

5 Limitations of this study

The focus of this study is limited to the pod-based in-order scale-out processor design proposed by Lotfi-Kamran et al. [3]. Although their study proposed an additional out-of-order scale-out processor design, we studied the in-order design as it was demonstrated to have higher performance density, i.e., throughput per unit area. Aggressive out-of-order designs might lead to different conclusions about how simple the memory scheduling technique should be and the needed off-chip memory bandwidth due to a potential increase in the MLP generated under such architectures.

Lotfi-Kamran et al. assumed a 270 mm

die, which was estimated to fit three 32-core pods and six memory controllers. This work studies the memory system requirements for one such pod representative of a lower-end server processor. Future work, should consider higher pod counts.

The study did not include the TCM [30] memory controller policy which also targets fairness; experiments with ATLAS and PAR-BS showed that fairness is not an issue for scale-out workloads.

The study includes a subset of the possible address mapping schemes and did not consider additional permutation-based interleaving schemes. However, the results presented here have identified performance deficiencies which could guide such a future study.

This study is limited to reevaluating previous proposals for memory scheduling algorithms, page management policies and multi-channel memory controllers. While no novel solution is presented, this study provides directions for simplifying or improving these designs to better match the needs of emerging server workloads. Finally, this study focused solely on performance. Energy and power are equally important considerations. The performance results presented here can be useful in such a future study.

6 Related Work

Rixner et al. introduced FR-FCFS scheduling techniques as well as other techniques that give preference to column access commands maximizing row-buffer hits and memory throughput [5, 23]. Rixner targeted web server workloads from SPECweb99 that exposed different memory access behavior than the currently wide spreading scale-out applications. We targeted different workloads and emphasized directions that could simplify memory controller design.

Natarajan et al. studied the impact of several memory controller design aspects on server processor performance [31]. The study investigated open-page vs. close-page policies, in-order vs. out-of-order memory requests scheduling and memory ranks interleaving. However, the study was limited to shared-bus multi-processor architectures and was based on synthetic random address traffic. Our study covers wider aspects of memory controller design, includes state-of-the-art proposals for each and studies full applications.

Barroso et al. introduced Piranha [32] an early architecture that balanced complexity, performance and energy to better meet the demands of server applications. PicoServer [1] is a simplified-core CMP design for server workloads that also favors relatively small on-chip caches. PicoServer relies on 3D die-stacking technology to provide low latency DRAM access.

Abts et al. studied intelligent memory controller placement in many-core server processors [33]. The study proposed diamond placement within the many-core tiles along with an enhanced routing algorithm for requests and replies, namely the class-based deterministic routing, to avoid the hot spots caused by the memory controllers. Our study is orthogonal to Abts’s placement research as we investigate the microarchitecture of the memory controllers that better suits the needs of the emerging cloud workloads.

Hardavellas et al. proposed using CMPs for scale-out server workloads through exploiting heterogeneous architectures and dark silicon to power up only the application-suitable cores [34]. The work was extended by Ferdman et al. [4] where scale-out workloads were shown to behave differently than traditional server workloads. The comparisons showed that commodity server processors are over-provisioned in several ways. Building upon that study, Lotfi-Kamran et al. introduced a pod-based CMP design that increases performance density [3].

7 Conclusions

Previous work had characterized the behavior of scale-out workloads and its impact on core architecture and the on-chip memory hierarchy.

This work builds upon these past studies by also studying the off-chip memory access characteristics of scale-out workloads and their interaction with state-of-the-art memory controller policies. The results showed that relatively simple memory scheduling algorithms worked best for these workloads outperforming other more advanced algorithms tuned to desktop workloads. The design of these advanced algorithms need to be revisited for scale-out workloads. We found that scale-out workloads exhibit poor row-buffer locality and, thus, better memory page management policies are needed to take advantage of the locality that exists while avoiding delaying the majority row-conflict requests.

Finally, additional memory channels did not significantly improve the performance for scale-out workloads. However, other scheduling policies, a different memory mapping, a different core architecture, or different data sets could have resulted in a different conclusion.

References

  • [1] T. Kgil, S. D’Souza, A. Saidi, N. Binkert, R. Dreslinski, T. Mudge, S. Reinhardt, and K. Flautner, “Picoserver: Using 3d stacking technology to enable a compact energy efficient chip multiprocessor,” in Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XII, 2006, pp. 117–128.
  • [2] V. Janapa Reddi, B. C. Lee, T. Chilimbi, and K. Vaid, “Web search using mobile cores: Quantifying and mitigating the price of efficiency,” in Proceedings of the 37th Annual International Symposium on Computer Architecture, ISCA ’10, pp. 314–325.
  • [3] P. Lotfi-Kamran, B. Grot, M. Ferdman, S. Volos, O. Kocberber, J. Picorel, A. Adileh, D. Jevdjic, S. Idgunji, E. Ozer, and B. Falsafi, “Scale-out processors,” in Proceedings of the 39th Annual International Symposium on Computer Architecture, ISCA ’12, pp. 500–511.
  • [4] M. Ferdman, A. Adileh, O. Kocberber, S. Volos, M. Alisafaee, D. Jevdjic, C. Kaynak, A. D. Popescu, A. Ailamaki, and B. Falsafi, “Clearing the clouds: A study of emerging scale-out workloads on modern hardware,” in Proceedings of the 17th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XVII, 2012, pp. 37–48.
  • [5] S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. D. Owens, “Memory access scheduling,” in Proceedings of the 27th Annual International Symposium on Computer Architecture, ISCA ’00, pp. 128–138.
  • [6] Y. Kim, D. Han, O. Mutlu, and M. Harchol-Balter, “Atlas: A scalable and high-performance scheduling algorithm for multiple memory controllers,” in Proceedings of the 16th Annual International Symposium on High Performance Computer Architecture, HPCA ’10, Jan, pp. 1–12.
  • [7] O. Mutlu and T. Moscibroda, “Parallelism-aware batch scheduling: Enhancing both performance and fairness of shared dram systems,” in Proceedings of the 35th Annual International Symposium on Computer Architecture, ISCA ’08, pp. 63–74.
  • [8] K. J. Nesbit, N. Aggarwal, J. Laudon, and J. E. Smith, “Fair queuing memory systems,” in Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 39, 2006, pp. 208–222.
  • [9] O. Mutlu and T. Moscibroda, “Stall-time fair memory access scheduling for chip multiprocessors,” in Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 40.   IEEE Computer Society, 2007, pp. 146–160.
  • [10] E. Ipek, O. Mutlu, J. F. Martínez, and R. Caruana, “Self-optimizing memory controllers: A reinforcement learning approach,” in Proceedings of the 35th Annual International Symposium on Computer Architecture, ISCA ’08, pp. 39–50.
  • [11] “Intel® Xeon® Processor E5-2699 v4,” http://ark.intel.com/products/91317/Intel-Xeon-Processor-E5-2699-v4-55M-Cache-2_20-GHz.
  • [12] “AMD Opteron™ 6300 Series Processors,” http://www.amd.com/en-us/products/server/opteron/6000/6300.
  • [13] T. Moscibroda and O. Mutlu, “Memory performance attacks: Denial of memory service in multi-core systems,” in Proceedings of 16th USENIX Security Symposium on USENIX Security Symposium, SS’07, pp. 18:1–18:18.
  • [14] R. S. Sutton and A. G. Barto, Introduction to Reinforcement Learning, 1st ed.   MIT Press, 1998.
  • [15] X. Shen, F. Song, H. Meng, S. An, and Z. Zhang, “Rbpp: A row based dram page policy for the many-core era,” in Proceedings of the 20th International Conference on Parallel and Distributed Systems, ICPADS ’14, Dec, pp. 999–1004.
  • [16] M. Awasthi, D. W. Nellans, R. Balasubramonian, and A. Davis, “Prediction based dram row-buffer management in the many-core era,” in Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques, PACT ’11, pp. 183–184.
  • [17] S. Kareenahalli, Z. Bogin, and M. Shah, “Adaptive idle timer for a memory device,” Jun. 21 2005, uS Patent 6,910,114.
  • [18] C. Teh, S. Kareenahalli, and Z. Bogin, “Dynamic update adaptive idle timer,” Oct. 4 2007, uS Patent App. 11/394,461.
  • [19] Y. Xu, A. S. Agarwal, and B. T. Davis, “Prediction in dynamic sdram controller policies,” in Proceedings of the 9th International Workshop on Embedded Computer Systems: Architectures, Modeling, and Simulation, SAMOS ’09, pp. 128–138.
  • [20] S.-I. Park and I.-C. Park, “History-based memory mode prediction for improving memory performance,” in Proceedings of the 2003 International Symposium on Circuits and Systems, ISCAS ’03, vol. 5, May 2003, pp. V–185–V–188 vol.5.
  • [21] M. M. K. Martin, D. J. Sorin, B. M. Beckmann, M. R. Marty, M. Xu, A. R. Alameldeen, K. E. Moore, M. D. Hill, and D. A. Wood, “Multifacet’s general execution-driven multiprocessor simulator (gems) toolset,” SIGARCH Comput. Archit. News, vol. 33, no. 4, pp. 92–99, Nov. 2005.
  • [22] T. F. Wenisch, R. E. Wunderlich, M. Ferdman, A. Ailamaki, B. Falsafi, and J. C. Hoe, “Simflex: Statistical sampling of computer system simulation,” IEEE Micro, vol. 26, no. 4, pp. 18–31, Jul. 2006.
  • [23] S. Rixner, “Memory controller optimizations for web servers,” in Proceedings of the 37th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 37, 2004, pp. 355–366.
  • [24] W. Zuravleff and T. Robinson, “Controller for a synchronous dram that maximizes throughput by allowing memory requests and commands to be issued out of order,” May 13 1997, uS Patent 5,630,096.
  • [25] S. A. Przybylski, Cache and Memory Hierarchy Design: A Performance-directed Approach.   Morgan Kaufmann Publishers Inc., 1990.
  • [26] C. Sangani, M. Venkatesan, and R. Ramesh, “Phase aware memory scheduling,” Stanford University, Electrical Engineering department, Tech. Rep., 2013.
  • [27] G. Liu, X. Zhang, D. Wang, Z. Liu, and H. Wang, “Security memory system for mobile device or computer against memory attacks,” in Trustworthy Computing and Services, Communications in Computer and Information Science.   Springer Berlin Heidelberg, 2014, vol. 426, pp. 1–8.
  • [28] J. L. Henning, “Spec cpu2006 benchmark descriptions,” SIGARCH Comput. Archit. News, vol. 34, no. 4, pp. 1–17, Sep. 2006.
  • [29] S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta, “The splash-2 programs: Characterization and methodological considerations,” in Proceedings of the 22nd Annual International Symposium on Computer Architecture, ISCA ’95, pp. 24–36.
  • [30] Y. Kim, M. Papamichael, O. Mutlu, and M. Harchol-Balter, “Thread cluster memory scheduling: Exploiting differences in memory access behavior,” in 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture, Dec 2010, pp. 65–76.
  • [31] C. Natarajan, B. Christenson, and F. Briggs, “A study of performance impact of memory controller features in multi-processor server environment,” in Proceedings of the 3rd Workshop on Memory Performance Issues: In Conjunction with the 31st International Symposium on Computer Architecture, WMPI ’04, pp. 80–87.
  • [32] L. A. Barroso, K. Gharachorloo, R. McNamara, A. Nowatzyk, S. Qadeer, B. Sano, S. Smith, R. Stets, and B. Verghese, “Piranha: A scalable architecture based on single-chip multiprocessing,” in Proceedings of the 27th Annual International Symposium on Computer Architecture, ISCA ’00, pp. 282–293.
  • [33] D. Abts, N. D. Enright Jerger, J. Kim, D. Gibson, and M. H. Lipasti, “Achieving predictable performance through better memory controller placement in many-core cmps,” in Proceedings of the 36th Annual International Symposium on Computer Architecture, ISCA ’09, pp. 451–461.
  • [34] N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki, “Toward dark silicon in servers,” IEEE Micro, vol. 31, no. 4, pp. 6–15, Jul. 2011.