1 Introduction
Energy proportionality is the key design goal pursued by architects of modern multicore CPU platforms [1, 2]. One of its implications is that optimization of an application for performance will also optimize it for energy. Modern multicore CPUs however have many inherent complexities, which are: a) Severe resource contention due to tight integration of tens of cores organized in multiple sockets with multilevel cache hierarchy and contending for shared onchip resources such as last level cache (LLC), interconnect (For example: Intel’s Quick Path Interconnect, AMD’s Hyper Transport), and DRAM controllers; b) Nonuniform memory access (NUMA) where the time for memory access between a core and main memory is not uniform and where main memory is distributed between locality domains or groups called NUMA nodes; and c) Dynamic power management (DPM) of multiple power domains (CPU sockets, DRAM).
The complexities were shown to result in complex (nonlinear) functional relationships between performance and workload size and between dynamic energy and workload size for reallife dataparallel applications on modern multicore CPUs [3, 4, 5]. Motivated by these research findings and based on further deep exploration, we show that energy proportionality does not hold true for multicore CPUs. This creates the opportunity for biobjective optimization of applications for performance and energy on a single multicore CPU.
We present now an overview of notable stateoftheart methods solving the biobjective optimization problem of an application for performance and energy on multicore CPU platforms. Systemlevel methods are introduced first since they dominated the landscape. This will be followed by recent research in applicationlevel methods. Then we describe the proposed solution method solving the biobjective optimization problem of an application for performance and energy on a single multicore CPU.
Solution methods solving the biobjective optimization problem for performance and energy can be broadly classified into
systemlevel and applicationlevel categories. Systemlevel methods aim to optimize performance and energy of the environment where the applications are executed. The methods employ applicationagnostic models and hardware parameters as decision variables. They are principally deployed at operating system (OS) level and therefore require changes to the OS. They do not involve any changes to the application. The methods can be further divided into the following prominent groups:
Thread schedulers that are contentionaware and that exploit cooperative data sharing between threads [6, 7]. The goal of a scheduler is to find threadtocore mappings to determine Paretooptimal solutions for performance and energy. The schedulers operate at both userlevel and OSlevel with those at OSlevel requiring changes to the OS. Threadtocore mapping is the key decision variable. Performance monitoring counters such as LLC miss rate and LLC access rate are used for predicting the performance given a threadtocore mapping.

Dynamic private cache (L1 and L2) reconfiguration and shared cache (L3) partitioning strategies [8, 9]. The proposed solutions in this category mitigate contention for shared onchip resources such as last level cache by physically partitioning it and therefore require substantial changes to the hardware or OS [10].

Thermal management algorithms that place or migrate threads to not only alleviate thermal hotspots and temperature variations in a chip but also reduce energy consumption during an application execution [11, 12]. Some key strategies are dynamic power management (DPM) where idle cores are switched off, Dynamic Voltage and Frequency Scaling (DVFS), which throttles the frequencies of the cores based on their utilization, sand migration of threads from hot cores to the colder cores.

Asymmetryaware schedulers that exploit the asymmetry between sets of cores in a multicore platform to find threadtocore mappings that provide Paretooptimal solutions for performance and energy [13, 14]. Asymmetry can be explicit with fast and slow cores or implicit due to nonuniform frequency scaling between different cores or performance differences introduced by manufacturing variations. The key decision variables employed here are threadtocore mapping and DVFS. Typical strategy is to map the most powerintensive threads to less powerhungry cores and then apply DVFS to the cores to ensure all threads complete at the same time whilst satisfying a power budget constraint.
In the second category, solution methods optimize applications rather than the executing environment. The methods use applicationlevel decision variables and predictive models for performance and energy consumption of applications to solve the biobjective optimization problem. The dominant decision variables include the number of threads, loop tile size, workload distribution, etc. Following the principle of energy proportionality, a dominant class of such solution methods aim to achieve optimal energy reduction by optimizing for performance alone. Definitive examples are scientific routines offered by vendorspecific software packages that are extensively optimized for performance. For example, Intel Math Kernel Library [15]
provides extensively optimized multithreaded basic linear algebra subprograms (BLAS) and 1D, 2D, and 3D fast Fourier transform (FFT) routines for Intel processors. Open source packages such as
[16, 17, 18] offer the same interface functions but contain portable optimizations and may exhibit better average performance than a heavily optimized vendor package [19, 20]. The optimized routines in these software packages allow employment of one key decision variable, which is the number of threads. A given workload is loadbalanced between the threads. In this work, we show that the optimal number of threads (and consequently loadbalanced workload distribution) maximizing the performance does not necessarily minimize the energy consumption of multicore CPUs.Stateoftheart research works on applicationlevel optimization methods [3, 4, 5] demonstrate that due to the aforementioned design complexities of modern multicore CPU platforms, the functional relationships between performance and workload size and between dynamic energy and workload size for reallife dataparallel applications have complex (nonlinear) properties and show that workload distribution has become an important decision variable that can no longer be ignored. Briefly, the total energy consumption during an application execution is the sum of dynamic and static energy consumptions. Static energy consumption is defined as the energy consumed by the platform without the application execution. Dynamic energy consumption is calculated by subtracting this static energy consumption from the total energy consumed by the platform during the application execution. The works [3, 4, 5] propose modelbased data partitioning methods that take as input discrete performance and dynamic energy functions with no shape assumptions, which accurately and realistically account for resource contention and NUMA inherent in modern multicore CPU platforms. Using a simulation of the execution of a dataparallel matrix multiplication application based on OpenBLAS DGEMM on a homogeneous cluster of multicore CPUs, it is shown [3] that optimizing for performance alone results in average and maximum dynamic energy reductions of 24% and 68%, but optimizing for dynamic energy alone results in performance degradations of 95% and 100%. For a 2D fast Fourier transform application based on FFTW, the average and maximum dynamic energy reductions are 29% and 55% and the average and maximum performance degradations are both 100%. Research work [4] proposes a solution method to solve biobjective optimization problem of an application for performance and energy on homogeneous clusters of modern multicore CPUs. This method is shown to determine a diverse set of globally Paretooptimal solutions whereas existing solution methods give only one solution when the problem size and number of processors are fixed. The methods [3, 4, 5] target homogeneous high performance computing (HPC) platforms. Khaleghzadeh et al. [21] propose a solution method solving the biobjective optimization problem on heterogeneous processors. The authors prove that for an arbitrary number of processors with linear execution time and dynamic energy functions, the globally Paretooptimal front is linear and contains an infinite number of solutions out of which one solution is load balanced while the rest are load imbalanced. A data partitioning algorithm is presented that takes as an input discrete performance and dynamic energy functions with no shape assumptions.
Technical Specifications  HCLServer1 (S1)  HCLServer2 (S2)  HCLServer3 (S3)  HCLServer4 (S4) 
Processor  Intel Xeon Gold 6152  Intel Haswell E52670V3  Intel Xeon CPU E52699  Intel Xeon Platinum 8180 
Core(s) per socket  22  12  18  28 
Socket(s)  1  2  2  2 
L1d cache, L1i cache  32 KB, 32 KB  32 KB, 32 KB  32 KB, 32 KB  32 KB, 32 KB 
L2 cache, L3 cache  256 KB, 30720 KB  256 KB, 30976 KB  256 KB, 46080 KB  1024 KB, 39424 KB 
Total main memory  96 GB  64 GB  256 GB  187 GB 
Power meter  WattsUp Pro  WattsUp Pro    Yokogawa WT310 
The research works [3, 4, 5, 21] are theoretical demonstrating performance and energy improvements based on simulations of clusters of homogeneous and heterogeneous nodes. Khokhriakov et al. [20] present two novel optimization methods to improve the average performance of the FFT routines on modern multicore CPUs. The methods employ workload distribution as the decision variable and are based on parallel computing employing threadgroups. They utilize load imbalancing data partitioning technique that determines optimal workload distributions between the threadgroups, which may not loadbalance the application in terms of execution time. The inputs to the methods are discrete 3D functions of performance against problem size of the threadgroups, and can be employed as nodal optimization techniques to construct a 2D FFT routine highly optimized for a dedicated target multicore CPU. The authors employ the methods to demonstrate significant performance improvements over the basic FFTW and Intel MKL FFT 2D routines on a modern Intel Haswell multicore CPU consisting of thirtysix physical cores.
The findings in [3, 4, 5, 20, 21] motivate us to study the influence of threedimensional decision variable space on biobjective optimization of applications for performance and energy on multicore CPUs. The three decision variables are: a). The number of identical multithreaded kernels (threadgroups) involved in the parallel execution of an application; b). The number of threads in each threadgroup; and c). The workload distribution between the threadgroups. We focus exclusively on the first two decision variables in this work. The number of possible workload distributions increases exponentially with increasing number of threadgroups employed in the execution of a dataparallel application and it would require employment of threadgroupspecific performance and energy models to reduce the complexity. It is a subject of our future work.
We propose and study the first applicationlevel method for biobjective optimization of multithreaded dataparallel applications on a single multicore CPU for performance and energy. The method uses two decision variables, the number of identical multithreaded kernels (threadgroups) executing the application in parallel and the number of threads in each threadgroup. The workload distribution is not a decision variable. It is fixed so that a given workload is always partitioned equally between the threadgroups. The method allows full reuse of highly optimized scientific codes and does not require any changes to hardware or OS. The first step of the method includes writing a dataparallel version of the base kernel that can be executed using a variable number of threadgroups in parallel and solving the same problem as the base kernel, which employs one threadgroup.
We demonstrate our method using four multithreaded applications: a) 2DFFT using FFTW 3.3.7; b) 2DFFT using Intel MKL FFT; c) Dense matrixmatrix multiplication using OpenBLAS; and d) Dense matrixmatrix multiplication using Intel MKL FFT.
Four different modern Intel multicore CPUs are used in the experiments: a) A singlesocket Intel Skylake consisting of 22 physical cores; b) A dualsocket Intel Haswell consisting of 24 physical cores; c) A dualsocket Intel Haswell consisting of 36 physical cores; and d) A dualsocket Intel Skylake consisting of 56 cores. Specifications of the experimental servers S1, S2, S3, and S4 equipped with these CPUs are given in Table IV. Servers S1, S2, and S4 are equipped with power meters and fully instrumented for systemlevel energy measurements. Server S3 is not equipped with a power meter and therefore is not employed in the experiments for singleobjective optimization for energy and biobjective optimization for performance and energy.
Figure 1 illustrates the energy nonproportionality on S2 found by our method for OpenBLAS DGEMM application solving workload size, N=16384. Data points in the graph represent different configurations of the multithreaded application solving exactly the same problem. Energy proportionality is signified by a monotonically increasing relationship between energy and execution time. This is clearly not the case for the relationship shown in the figure.
The average and maximum performance improvements using the number of threadgroups and the number of threads per group as decision variables for performance optimization on a singlesocket multicore CPU (S1) are (7%, 26.3%), (5%, 6.5%) and (27%, 69%) for the OpenBLAS DGEMM, Intel MKL DGEMM and Intel MKL FFT applications against their best single threadgroup configurations. Along with performance optimization, the energy improvements for OpenBLAS DGEMM and Intel MKL DGEMM are (7.9%, 30%) and (35.7%, 67%) against their best single threadgroup configurations.
At the same time, the optimization for performance alone results in average and maximum increases in dynamic energy consumption of (22.5%, 67%) and (87%, 89%) for the Intel MKL DGEMM and Intel MKL FFT applications in comparison with their energyoptimal configurations. The optimization for dynamic energy alone results in average and maximum performance degradations of (27%, 39%) and (19.7%, 38.2%) in comparison with their performanceoptimal configurations. The average and the maximum number of globally Paretooptimal solutions for Intel MKL DGEMM and Intel MKL FFT are (2.3, 3) and (2.6, 3).
On the 24core dualsocket CPU (S2), the average and maximum performance improvements of (16%, 20%) and (8%, 21%) for the OpenBLAS DGEMM and Intel MKL DGEMM applications against their best singlethreadgroup configurations. Even higher average and maximum performance improvements of (30%, 50%) are achieved for the FFTW application on the 56core dualsocket CPU (S4). Again, the improvements are measured against the original singlethreadgroup basic routine employing optimal number of threads.
At the same time, we find that optimization of the OpenBLAS DGEMM and Intel MKL DGEMM applications on S2 for performance only, results in average and maximum increases in dynamic energy consumption of (15%, 35%) and (7.1%, 49%) in comparison with their energyoptimal configurations, and optimization of the Intel MKL FFT and FFTW applications on S4 for performance alone results in average and maximum increases in dynamic energy consumption of (7%, 25%) and (15%, 57%).
On S2, the optimization of the OpenBLAS DGEMM and Intel MKL DGEMM applications for energy only, results in average and maximum performance degradations of (2.5%, 6%) and (3.7%, 11%). On S4, the average and maximum performance degradations are (20%, 33%) and (31%, 49%) for the Intel MKL FFT and FFTW applications. The performance degradations are over the performanceoptimal configuration.
By solving the biobjective optimization problem on three servers {S1,S2,S4}, the average and the maximum number of globally Paretooptimal solutions determined by out method are (2.7, 3), (3,11), (2.4, 5) and (1.8, 4) for Intel MKL FFT, FFTW, OpenBLAS DGEMM and Intel MKL DGEMM applications. Finally, we propose a qualitative dynamic energy model based on linear regression and employing performance monitoring counters (PMCs) as parameters, which we use to explain the discovered energy nonproportionality and the Paretooptimal solutions determined by our method.
The main contributions in this work are the following:

We show that energy proportionality does not hold true for multicore CPUs thereby affording an opportunity for biobjective optimization for performance and energy.

We propose and study the first applicationlevel method for biobjective optimization of multithreaded dataparallel applications for performance and energy. The method uses two decision variables, the number of identical multithreaded kernels (threadgroups) and the number of threads in each threadgroup. Using four highly optimized dataparallel applications, the proposed method is shown to determine good numbers of globally Paretooptimal configurations of the applications providing the programmers better tradeoffs between performance and energy consumption.

A qualitative dynamic energy model based on linear regression and employing performance monitoring counters (PMCs) as parameters is proposed to explain the Paretooptimal solutions determined by our solution method for multicore CPUs. The model shows that the energy nonproportionality on our experimental platforms for the two dataparallel applications is due to disproportionately high energy consumption by the data translation lookaside buffer (dTLB) activity.
The rest of the paper is organized as follows. Section 2 presents the related work. Section 3 contains brief background on multiobjective optimization and the concept of Paretooptimality. Section 4 describes our solution method. Section 5 describes the first step of our solution method for two dataparallel applications, 2D fast Fourier transform and matrixmatrix multiplication. Section 7 contains the experimental results. Section 7.4 presents our dynamic energy model employing PMCs as parameters to explain the cause behind the energy nonproportionality on our experimental platforms. Section 8 concludes the paper.
2 Related Work
We present an overview of singleobjective optimization solution methods for performance or energy followed by biobjective optimization solution methods for both performance and energy on multicore CPU platforms. Energy models of computing complete the section.
2.1 Performance Optimization
There are three dominant approaches in this category. First category contains research works [22, 23] that have proposed contentionaware threadlevel schedulers that try to minimize performance losses due to contention for onchip shared resources.
The second category includes DRAM controller schedulers that aim to efficiently utilize the shared resource, which is the DRAM memory system, and last level cache partitioning that physically partition the shared resources to minimize contention. DRAM controller schedulers [24, 25] improve the throughput by ordering threads and prioritizing their memory requests through DRAM controllers. Last level cache partitioners [26, 27] explicitly partition the cache when the default cache replacement policies (such as leastrecentlyused (LRU)) do not result in efficient execution of applications. These partitioners, however, must be used in conjunction with schedulers that mitigate contention for memory controllers and onchip interconnects.
The final category includes research works that focus on threadlevel schedulers that exploit data sharing between the threads to coschedule them [28, 29]. A key building work in the schedulers are performance models based on PMCs that can predict performance loss due to coscheduling or migrating threads between cores.
2.2 Energy Optimization
There are three important categories dealing with energy optimization on multicore CPU platforms. The software category contains research works that propose shared resource partitioners. The two hardware categories concern research works that employ Dynamic Voltage and Frequency Scaling (DVFS) and Dynamic Power Management (DPM) and thermal management. Zhuravlev et al. [30] survey the prominent works in all the three categories.
Research works [8, 9] propose dynamic reconfiguration of private caches and partitioning of shared caches (last level cache, for example) to reduce the energy consumption without hurting performance.
DVFS and DPM allow changing the frequencies of the cores and to lower their power states when they are idle. Considering the enormity of literature in this category, we will cover only works that take into account resource contention and threadtocore mapping while employing DVFS. Kadayif et al. [31] exploit the heterogeneous nature of workloads executed by different processors to set their frequencies so as to reduce energy without impacting performance. Research works [32, 33] employ DVFS to reduce resource contention and energy consumption.
The main goal of thermal management algorithms is to find threadtocore mappings (or even thread migration) to remove drastic variations in temperatures or thermal hotspots in the chip and at the same time reduce the energy consumption without impacting the performance. They employ as inputs thermal models that are built using temperature measurements provided by onchip sensors [11, 12]. The algorithms are chiefly employed at the OS level.
Asymmetryaware schedulers have been proposed for energy optimization on asymmetric multicore systems, which feature a mix of fast and slow cores, highpower and lowpower cores but that expose the same instructionset architecture (ISA). Fedorova et al. [34] propose a systemlevel scheduler that assigns sequential phases of an application to fast cores and parallel phases to slow cores to maximize the energy efficiency. Herbert et al. [35] employ DVFS to exploit the coretocore variations from fabrication in power and performance to improve the energy efficiency of the multicore platform.
2.3 Optimization for Performance and Energy
Das et al. [36] propose task mapping to optimize for energy and reliability on multiprocessor systemsonchip (MPSoCs) with performance as a constraint. Sheikh et al. [37]
propose task scheduler employing evolutionary algorithms to optimize applications on multicore CPU platforms for performance, energy, and temperature. Abdi et al.
[38] propose multicriteria optimization where they minimize the execution time under three constraints, the reliability, the power consumption, and the peak temperature. DVFS is a key decision variable in all of these research works.The following research works focus on applicationlevel solution methods. Subramaniam et al. [39] use multivariable regression to study the performanceenergy tradeoffs of the highperformance LINPACK (HPL) benchmark. They study performanceenergy tradeoffs using the decision variables, number of threads and number of processes. Marszalkowski et al. [40] analyze the impact of memory hierarchies on timeenergy tradeoff in parallel computations, which are represented as divisible loads. They represent execution time and energy by two linear functions on problem size, one for incore computations and the other for outofcore computations.
Research works [3, 5] propose data partitioning algorithms that solve singleobjective optimization problems of dataparallel applications for performance or energy on homogeneous clusters of multicore CPUs. They take as an input, discrete performance and dynamic energy functions with no shape assumptions and that accurately and realistically account for resource contention and NUMA inherent in modern multicore CPU platforms. Research work [4] proposes a solution method to solve biobjective optimization problem of an application for performance and energy on homogeneous clusters of modern multicore CPUs. They demonstrate that the method gives a diverse set of globally Paretooptimal solutions and that it can be combined with DVFSbased multiobjective optimization methods to give a better set of (globally Paretooptimal) solutions. The methods target homogeneous HPC platforms. Chakraborti et al. [41] consider the effect of heterogeneous workload distribution on biobjective optimization of data analytics applications by simulating heterogeneity on homogeneous clusters. The performance is represented by a linear function of problem size and the total energy is predicted using historical data tables. Khaleghzadeh et al. [21] propose a solution method solving the biobjective optimization problem on heterogeneous processors and comprising of two principal components. The first component is a data partitioning algorithm that takes as an input discrete performance and dynamic energy functions with no shape assumptions. The second component is a novel methodology employed to build the discrete dynamic energy profiles of individual computing devices, which are input to the algorithm.
2.4 Energy Predictive Models of Computation
Energy predictive models predominantly employ performance monitoring counters (PMCs) as parameters. Bellosa et al. [42] propose an energy model based on performance monitoring counters such as integer operations, floatingpoint operations, memory requests due to cache misses, etc. that they believed to strongly correlate with energy consumption. A linear model that is based on the utilization of CPU, disk, and network is presented in [43]. A more complex power model (Mantis) [44]
employs utilization metrics of CPU, disk, and network components and hardware performance counters for memory as predictor variables.
Fan et al. [45] propose a simple linear model that correlates power consumption of a singlecore processor with its utilization. Bertran et al. [46] present a power model that provides percomponent power breakdown of a multicore CPU. Dargie et al. [47] use the statistics of CPU utilization (instead of PMCs) to model the relationship between the power consumption of multicore processor and workload quantitatively. They demonstrate that the relationship is quadratic for singlecore processor and linear for multicore processors. Lastovetsky et al. [3] present an applicationlevel energy model where the dynamic energy consumption of a processor is represented by a discrete function of problem size, which is shown to be highly nonlinear for dataparallel applications on modern multicore CPUs.
3 MultiObjective Optimization: Background
A multiobjective optimization (MOP) problem may be defined as follows [48],[49]:
where there are objective functions . The objective is to minimize all the objective functions simultaneously.
denotes the vector of objective functions. The decision (variable) vectors
belong to the (nonempty) feasible region (set) , which is a subset of the decision variable space . We call the image of the feasible region represented by (), the feasible objective region. It is a subset of the objective space . The elements of are called objective (function) vectors or criterion vectors and denoted by or , where are objective (function) values or criterion values.If there is no conflict between the objective functions, then a solution can be found where every objective function attains its optimum [49].
However, in reallife multiobjective optimization problems, the objective functions are at least partly conflicting. Because of this conflicting nature of objective functions, it is not possible to find a single solution that would be optimal for all the objectives simultaneously. In multiobjective optimization, there is no natural ordering in the objective space because it is only partially ordered. Therefore we must treat the concept of optimality differently from singleobjective optimization problem. The generally used concept is Paretooptimality.
Definition 1.
A decision vector is Paretooptimal if there does not exist another decision vector such that and for at least one index [48].
An objective vector is Paretooptimal if there does not exist another objective vector such that and for at least one index .
Definition 2.
A decision vector is weakly Paretooptimal if there does not exist another decision vector such that [48].
An objective vector is Paretooptimal if there does not exist any other vector for which all the component objective vector values are better.
Mathematically speaking, every Paretooptimal point is an equally acceptable solution of the multiobjective optimization problem. Therefore, user preference relations (or preferences of decision maker) are provided as input to the solution process to select one or more points from the set of Paretooptimal solutions [48].
In Figure 2, a feasible region and its image, a feasible objective region , are shown. The thick blue line in the figure showing the objective space contains all the Paretooptimal objective vectors. The vector is one of them.
In this work, we consider biobjective optimization where performance and dynamic energy are the objectives.
4 Solution Method Solving Biobjective Optimization Problem on a Single Multicore CPU
In this section, we describe our solution method, BOPPETG, for solving the biobjective optimization problem of a multithreaded dataparallel application on multicore CPUs for performance and energy (BOPPE). The method uses two decision variables, the number of identical multithreaded kernels (threadgroups) and the number of threads in each threadgroup. A given workload is always partitioned equally between the threadgroups.
The biobjective optimization problem (BOPPE) can be formulated as follows: Given a multithreaded dataparallel application of workload size and a multicore CPU of cores, the problem is to find a globally Paretooptimal front of solutions optimizing execution time and dynamic energy consumption during the parallel execution of the workload. Each solution is an application configuration given by (threadgroups, threads per group).
The inputs to the solution method are the workload size of the multithreaded dataparallel application, ; the number of cores in the multicore CPU, ; the multithreaded base kernel, ; the base power of the multicore CPU platform, . The outputs are the globally Paretooptimal front of objective solutions, , and the optimal application configurations corresponding to these solutions, . Each Paretooptimal solution of objectives is represented by the pair, , where is the execution time and is the dynamic energy. Associated with this solution is an array of application configurations, , containing decision variable pairs, , where represents the number of threadgroups each containing threads.
The main steps of BOPPETG are as follows:
Step 1. Parallel implementation allowing (,) configuration: Design and implement a dataparallel version of the base kernel and that can be executed using identical multithreaded kernels in parallel. Each kernel is executed by a threadgroup containing threads. The workload is divided equally between the threadgroups during the execution of the dataparallel version. The dataparallel version should essentially allow its runtime configuration using number of threadgroups and number of threads per group with the workload equally partitioned between the threadgroups.
Step 2. Initialize and : All the runtime configurations, (,), where the product, , is less than or equal to the total number of cores () in the multicore platform are considered. , . Go to Step 3.
Step 3. Determine time and dynamic energy of the (,) configuration of the application: The dataparallel version composed in Step 1 is run using the (,) configuration. Its execution time and dynamic energy consumption are determined as follows: , , where and are the starting and ending execution times and is the total energy consumption during the execution of the application. Go to Step 4.
Step 4. Update Paretooptimal front for (,): The solution if Paretooptimal is added to the globally Paretooptimal set of objective solutions, , and existing member solutions of the set that are inferior to it are removed. The optimal application configurations corresponding to the solution are stored in . Go to Step 5.
Step 5. Test and Increment (,): If , , go to Step 3. Set , . If , go to Step 3. Else return the globally Paretooptimal front and optimal application configurations given by and quit.
In the following section, we illustrate the first step of BOPPETG for two applications, matrixmatrix multiplication and 2D fast Fourier transform. We show in particular how BOPPETG can reuse highly optimized scientific kernels with careful design and development of parallel versions of the application.
5 Parallel MatrixMatrix Multiplication
We illustrate the first step of our solution method (BOPPETG) for implementing the dataparallel version of dense matrixmatrix multiplication (PMMTG).
The PMMTG application computes the matrix product () of two dense square matrices A and B of size . The application is executed using threadgroups, . To simplify the exposition of the algorithms, we assume N to be divisible by p.
There are three parallel algorithmic variants of PMMTG. In PMMTGV, the matrices B and C are partitioned vertically such that each threadgroup is assigned of the columns of B and C as shown in the Figure (a)a. Each threadgroup computes its vertical partition using the matrix product, . In PMMTGH, the matrices A and C are partitioned horizontally such that each threadgroup is assigned of the rows of B and C as shown in the Figure (b)b. Each threadgroup computes its horizontal partition using the matrix product, . In PMMTGS, the threadgroups are arranged in a square grid . The matrices A, B, and C are partitioned into equal squares among the threadgroups as shown in the Figure (c)c. In each matrix, each threadgroup is assigned a submatrix of size and computes its square partition using the matrix product, . is the square block in matrix located at . is the square block in matrix located at .
5.1 Implementation of PMMTGH Based on OpenBLAS DGEMM
We describe an OpenBLAS implementation of PMMTGH (Figure 5) here. The implementations of the other PMMTG algorithms employing Intel MKL and OpenBLAS are described in the supplemental.
The inputs to an implementation are: a). Matrices A, B, and C of sizes ; b). Constants and ; c) The number of threadgroups, ; d). The number of threads in each threadgroup represented by . The output matrix, C, contains the matrix product.
The vertical partitions of A and C, {}, , assigned to the threadgroups, , are initialized in Lines 2434. Then pthreads representing the threadgroups are created, each a multithreaded OpenBLAS DGEMM kernel executing OpenMP threads (Lines 3643).The threadgroups compute the matrixmatrix product (Lines 120). The result is gathered in the matrix C (Lines 4556).
The implementations using Intel MKL differ from those using OpenBLAS. In Intel MKL, the matrixmatrix computation by a threadgroup is performed using an OpenMP parallel region with threads whereas the same is done in OpenBLAS using a pthread.
6 Parallel 2D Fast Fourier Transform
We present here the first step of our solution method (BOPPETG) to compose the dataparallel version of 2D Fast Fourier Transform (PFFTTG). The sequential 2D FFT algorithm is described first before the two parallel algorithmic variants of 2D Fast Fourier Transform.
The definition of 2DDFT of a twodimensional point discrete signal M of size is below:
M is the signal matrix where each element is a complex number. The total number of complex multiplications required to compute the 2DDFT is .
The sequential rowcolumn decomposition method reduces this complexity by computing the 2DDFT using a series of 1DDFTs, which are implemented using a fast 1DFFT algorithm. The method consists of two phases called the rowtransform phase and columntransform phase. Figure 4 depicts the method, which is mathematically summarized below:
It computes a series of ordered 1DFFTs of size N on the N rows. That is, each row i (of length N) is transformed via a fast 1DFFT to . The total cost of this rowtransform phase is . Then, it computes a series of ordered 1DFFTs on the N columns of . The column of is transformed to . The total cost of this columntransform phase is . Thus, by using the rowcolumn decomposition method, the complexity of 2DFFT is reduced from to . All the FFTs that we discuss in this work are considered inplace.
The PFFTTG application employing our solution method computes the 2DDFT of the signal matrix of size using threadgroups, . It is based on the sequential 2DFFT rowcolumn decomposition method. There are two parallel algorithmic variants of PFFTTG, PFFTTGH and PFFTTGV. To simplify the exposition of the algorithms, we assume N to be divisible by p.
6.1 PFFTTGH: Using Horizontal Decomposition of Signal Matrix
The parallel 2DFFT algorithm, PFFTTGH, consists of four steps:
Step 1. 1DFFTs on rows: Threadgroup executes sequential 1DFFTs on rows .
Step 2. Matrix Transposition: Transpose the matrix M.
Step 3. 1DFFTs on rows: Threadgroup executes sequential 1DFFTs on rows .
Step 4. Matrix Transposition: Transpose the matrix M.
The computational complexity of Steps 1 and 3 is . The computational complexity of Steps 2 and 4 is . Therefore, the total computational complexity of PFFTTGH is .
The algorithm is illustrated in the Figure 4.
6.2 PFFTTGV: Using Vertical Decomposition of Signal Matrix
The parallel 2DFFT algorithm, PFFTTGV, consists of four steps:
Step 1. 1DFFTs on columns: Threadgroup executes sequential 1DFFTs on columns .
Step 2. Matrix Transposition: Transpose the matrix M.
Step 3. 1DFFTs on columns: Threadgroup executes sequential 1DFFTs on columns .
Step 4. Matrix Transposition: Transpose the matrix M.
The computational complexity of Steps 1 and 3 is . The computational complexity of Steps 2 and 4 is . Therefore, the total computational complexity of PFFTTGV is .
The algorithm is illustrated in the Figure 4.
6.3 Implementation of PFFTTGH Based on FFTW
Figure 6 illustrates the FFTW implementation of PFFTTGH. The shared memory implementations of other PFFTTG algorithms based on Intel MKL and FFTW are described in the supplemental.
The inputs to an implementation are: a). Signal matrix M of size ; b). The number of threadgroups, , ; c). The number of threads in each threadgroup represented by . The output is the transformed signal matrix M (considering that we are performing inplace FFT).
Lines 1718 show the initialization of FFTW multithreaded runtime. Lines 1925 show the creation of FFT plans, each plan executed by a threadgroup of threads. Lines 111 illustrate the creation of a plan using fftw_dft_plan_many routine. Lines 2639 show the execution and destruction of the plans (1DFFTs on rows) by the threadgroups. This is followed by transpose of the signal matrix (Line 40). Lines 4146 contain the creation of FFT plans (1DFFTs on rows) followed by their execution by the threadgroups. Finally, the signal matrix is transposed again (Line 61). The FFTW runtime is then destroyed (Line 62).
The implementations based on Intel MKL differ from those employing FFTW. In FFTW, only plan execution (fftw_plan_many_dft) and plan destruction (fftw_destroy_plan) are threadsafe and can be called in an OpenMP parallel region.
7 Experimental Results and Discussion
In this section, we present our experimental results for matrixmatrix multiplication (PMMTG) and 2D fast Fourier transform (PFFTTG) employing our solution method.
To make sure the experimental results are reliable, we follow a statistical methodology described in the supplemental. Briefly, for every data point in the functions, the automation software executes the application repeatedly until the sample mean lies in the 95% confidence interval and a precision of 0.025 (2.5%) has been achieved. For this purpose, Student’s ttest is used assuming that the individual observations are independent and their population follows the normal distribution. The validity of these assumptions is verified by plotting the distributions of observations and using Pearson’s Test. The speed/time/energy values shown in the graphical plots are the sample means.
Four multicore CPUs shown in the Table IV and described earlier are used in the experiments. Three platforms {S1, S2, S4} have a power meter installed between their input power sockets and the wall A/C outlets. S1 and S2 are connected with a Watts Up Pro power meter; S4 is connected with a Yokogawa WT310 power meter. S3 is not equipped with a power meter and therefore is not employed in the experiments for singleobjective optimization for energy and biobjective optimization for performance and energy.
The power meter provides the total power consumption of the server. It has a data cable connected to one USB port of the server. A script written in Perl collects the data from the power meter using the serial USB interface. The execution of the script is nonintrusive and consumes insignificant power. WattsUp Pro power meters are periodically calibrated using the ANSI C12.20 revenuegrade power meter, Yokogawa WT310. The maximum sampling speed of WattsUp Pro power meters is one sample every second. The accuracy specified in the datasheets is . The minimum measurable power is 0.5 watts. The accuracy at 0.5 watts is watts. The accuracy of Yokogawa WT310 is 0.1% and the sampling rate is 100k samples per second.
HCLWattsUp API [50] is used to gather the readings from the power meter to determine the dynamic energy consumption during the execution of PMMTG and PFFTTG applications. HCLWattsUp has no extra overhead and therefore does not influence the energy consumption of the application execution.
Fans are significant contributors to energy consumption. On our platform, fans are controlled in two zones: a) zone 0: CPU or System fans, b) zone 1: Peripheral zone fans. There are 4 levels to control the speed of fans:

Standard: BMC control of both fan zones, with CPU zone based on CPU temp (target speed 50%) and Peripheral zone based on PCH temp (target speed 50%)

Optimal: BMC control of the CPU zone (target speed 30%), with Peripheral zone fixed at low speed (fixed 30%)

Heavy IO: BMC control of CPU zone (target speed 50%), Peripheral zone fixed at 75%

Full: all fans running at 100%
To rule out the contribution of fans in dynamic energy consumption, we set the fans at full speed before executing the applications. When set at full speed, the fans run constantly at rpm until they are set to a different speed level. In this way, energy consumption due to fans is included only in the static power consumption of the platform. The temperature of our platform and speeds of the fans (with Full setting) is monitored with the help of Intelligent Platform Management Interface (IPMI) sensors, both with and without the application run. An insignificant difference in the speeds of fans is found in both the scenarios.
7.1 Parallel MatrixMatrix Multiplication Using OpenBLAS DGEMM and Intel MKL DGEMM
7.1.1 Performance Optimization on a Single Socket Multicore CPU
Fiqure 7 shows the execution times of PMMTG using OpenBLAS DGEMM for different threadgroup combinations on a singlesocket CPU (S1). The base version corresponds to the application configuration employing one threadgroup with optimal number of threads, which is 44 threads. The best combination is (,)=(22,1) for all the three workload sizes. It outperforms the base combination by 20% for N=29696 and N=35328, and about 11% for N=30720. Furthermore, the average performance improvement over the base combination for 41 tested workload sizes in the range, , is 7%. The starting problem size of 5120 is chosen to ensure that the workload size exceeds the last level cache.
Fiqure 8 shows the execution times of PMMTG using Intel MKL DGEMM. The best combinations (,) are {(4,11),(2,22)} for all the three workload sizes. They outperform the base combination by 6%. The average performance improvement over the base combination for 21 tested workload sizes in the range, , is 5%.
7.1.2 Performance Optimization on a Dualsocket Multicore CPU
Figure 9 shows the comparision between base and best combinations for OpenBLAS DGEMM and Intel MKL DGEMM on S3. The base version corresponds to application configuration employing one threadgroup with optimal number of threads.
Unlike the base version, the best combinations for OpenBLAS DGEMM and Intel MKL DGEMM do not have any performance variations (drops). The best combination for Intel MKL DGEMM is 18 threadgroups with 2 threads each. It outperforms the base version by 8% on the average and the next best combination, 12 threadgroups with 2 threads each, by 2.5%. Our solution method removed noticeable drops in performance for workload sizes 16384, 20480, and 24576, with performance improvements of 36.5%, 14.5% and 21.5%.
7.1.3 Energy Optimization on a Single Socket Multicore CPU
Fiqure 10 shows the dynamic energy consumptions for PMMTG using OpenBLAS DGEMM of different threadgroup combinations on a singlesocket CPU (S1). The base version corresponds to application configuration employing one threadgroup with optimal number of threads, which is 44 threads. The best combination for sizes N=29696 and N=30720 is (,)=(22,1). It outperforms the base combination by 20%. The best combination for is (,)=(1,22), which outperforms the base combination by 23%. Furthermore, the average improvement (or energy savings) over the base combination for 41 tested workload sizes in the range, , is 8%.
Fiqure 11 shows the dynamic energy consumptions for PMMTG using Intel MKL DGEMM. There are three best combinations for each problem size, (,)={(11,4),(22,2),(44,1)}. They outperform the base combination by 35%. Furthermore, the average improvement over the base combination for 21 tested workload sizes in the range, , is 35.7%.
7.1.4 Energy Optimization on a Dualsocket Multicore CPU
Figure 12 show the results for PMMTG based on OpenBLAS DGEMM on S2 with three different workload sizes. There are four best combinations minimizing the dynamic energy consumption for workload size 16384, (,)={(2,24),(3,16),(6,8),(24,2)}. The energy savings for these combinations compared with the best base combination, (,)=(1,24), is around 21%. For the workload sizes 17408 and 18432, the best combinations are (12,4) and (4,12). The energy savings in comparison with the best base combination, (,)=(1,24), for 17408 and (,)=(1,44) for 18432, are 15% and 18%. Furthermore, the average improvement over the best base combination for 19 tested workload sizes in the range, , is 10%.
Figure 13 show the results for PMMTG based on Intel MKL DGEMM on S2. The best combination minimizing the dynamic energy consumption for workload size 28672 involves 12 threadgroups with 2 threads each. The energy savings for this combination compared with the best base combination, (1,24), is 10.5%. For the workload sizes 30720 and 31616, the best combinations are (12,4) and (12,2). The energy savings in comparison with the best base combination are 4% and 7%. Furthermore, the average improvement over the best base combination for 19 tested workload sizes in the range, , is 13%.
7.2 Parallel 2D Fast Fourier Transform Using FFTW and Intel MKL FFT
In this section, we use 2D fast Fourier transform routines from two packages, FFTW3.3.7 and Intel MKL. The packages are installed with multithreading, SSE/SSE2, AVX2, and FMA (fused multiplyadd) optimizations enabled. For Intel MKL FFT, no special environment variables are used. Three planner flags, {FFTW_ESTIMATE, FFTW_MEASURE, FFTW_PATIENT} were tested. The execution times for the flags {FFTW_MEASURE, FFTW_PATIENT} are high compared to those for FFTW_ESTIMATE. The long execution times are due to the lengthy times to create the plans because FFTW_MEASURE tries to find an optimized plan by computing many FFTs whereas FFTW_PATIENT considers a wider range of algorithms to find a more optimal plan.
7.2.1 Performance Optimization on a Single Socket Multicore CPU
Figure 14 shows the results for PFFTTG employing FFTW on a singlesocket CPU (S1). The best combination, ()=(4,11), is the same for workload sizes, N=31936 and N=32704. The improvements over the base combination, ()=(1,44), are 55% and 57%. For matrix dimension, N=35648, the base combination is the best and outperforms the next best combination, ()=(2,22), by 5%.
Figure 15 shows the results for PFFTTG employing Intel MKL FFT. There are three best combinations, ()=(2,22),(2,11),(4,11), for all the three workload sizes, where performances differ from each other by less than 5%. Their improvement over the base combination, ()=(1,44), for N=18432 is 8%. For workload sizes, N=30720 and N=31616, the performance improvements are 25% and 26%. Furthermore, the average performance improvement over the best base combination for 23 tested workload sizes in the range, , is 27%.
7.2.2 Performance Optimization on Dualsocket Multicore CPUs
All results in this section are represented by a 3D surface represented by axes for performance or energy, number of threadgroups () and the number of threads in each threadgroup, . The location of the minimum in the surface is shown by the red dot.
Figure (a)a shows the results of PFFTTG using FFTW3.3.7 on S4 for matrix dimension N=30976. The area with minimum execution time is located in the figure in the region containing {4,7,8} threadgroups with 10 threads in each group. The minimum is achieved for the combination ()=(7,10) with the execution time of 8 seconds. The speedup is around 100% in comparison with the best combination of threads for one group ()=(1,10) where the execution time is 16 seconds.
Figure (b)b presents the results of PFFTTG using FFTW3.3.7 on S3 for the matrix dimension N=17728. The minimum is centred around number of threadsgroups equal to {4,7,8}. The minimum is achieved for the combination, ()=(4,16). The performance improvement is 80% in comparison with ()=(1,72), which is the best combination for one group.
7.2.3 Energy Optimization on a Single Socket Multicore CPU
Figure 17 shows the dynamic energy comparision for PFFTTG employing FFTW between base and best combinations for workload sizes, 31936, 32704, and 35648 on a singlesocket CPU (S1). The best combination ()=(4,11) is the same for workload sizes, 31936 and 32704. The reductions in dynamic energy consumption in comparison with the base combination, ()=(1,44), are 41% and 65%. For workload size 35648, the base combination is the best and outperforms the next best combination ()=(2,22) by 5%. For Intel MKL FFT, the base combination, (,)=(1,44), is the best.
7.2.4 Energy Optimization on a Dualsocket Multicore CPU
Figures (a)a, (b)b show the results for PFFTTG employing FFTW on S4 for matrix sizes equal to N=30464 and N=32192. The minimum for dynamic energy is located in {4,7,8} threadgroups with 14 threads in each threadgroup for workload size (N=32192) and 12 threads in each threadgroup for workload size 30464. The minimum for the workload size 30464 is achieved for the combination, ()=(8,12). The dynamic energy consumption for this combination is 661 Joules. The energy saving is around 30% in comparison with the best combination of threads for one group ()=(1,45) whose dynamic energy consumption is 918 Joules. The minimum for the workload size (N=32192) is achieved for the combination, ()=(4,14). The saving is around 35% in comparison with ()=(1,16) where dynamic energy is 2197 Joules.
7.3 BiObjective Optimization for Performance and Dynamic Energy
7.3.1 Single Socket Multicore CPU
Figure (a)a shows the globally Paretooptimal front for PMMTG employing Intel MKL DGEMM on S1 for workload size 32768. Optimizing for dynamic energy consumption alone degrades performance by 27%, and optimizing for performance alone increases dynamic energy consumption by 30%. The average and maximum sizes of the Paretooptimal fronts for Intel MKL DGEMM are (2.3,3).
Figure (b)b shows the globally Paretooptimal front for PFFTTG based on Intel MKL FFT on S1 for workload size 31744. There are two globally Paretooptimal solutions. Optimizing for dynamic energy consumption alone degrades performance by around 31%, and optimizing for performance alone increases dynamic energy consumption by 87%. The average and maximum sizes of the Paretooptimal fronts for Intel MKL FFT are (2.6,3).
No biobjective tradeoffs were observed for FFTW and OpenBLAS applications. We will investigate two lines of research in our future work. One is the influence of workload distribution; The other is the absence of biobjective tradeoffs for opensource packages such as FFTW and OpenBLAS using a dynamic energy predictive model.
7.3.2 Dualsocket Multicore CPUs
In this section, we will focus on biobjective optimization on dualsocket CPUs, S2 and S4.
Figures (a)a shows the globally Paretooptimal fronts for PFFTTG FFTW on S4 for workload size, N=30464. The maximum number of globally Paretooptimal solutions is 11. The optimization for dynamic energy consumption alone degrades performance by 49%, and optimizing for performance alone increases dynamic energy consumption by 35%.
Figure (b)b shows the globally Paretooptimal front for PFFTTG employing Intel MKL FFT on S2 for workload size, N=22208. Optimizing for dynamic energy consumption alone degrades performance by 33%, and optimizing for performance alone increases dynamic energy consumption by 10%. The average and maximum sizes of the Paretooptimal fronts for FFTW and Intel MKL FFT are (3,11) and (2.7, 3).
Figure (a)a shows the globally Paretooptimal front for PMMTG employing Intel MKL DGEMM on S2 for workload size, N=17408. Optimizing for dynamic energy consumption alone degrades performance by 5.5%, and optimizing for performance alone increases dynamic energy consumption by 50.7%. The average and maximum sizes of the Paretooptimal fronts are (1.8, 4).
Figure (b)b shows the globally Paretooptimal front for PMMTG based on OpenBLAS DGEMM on S2 for workload size, N=17408. There are six globally Paretooptimal solutions. Optimizing for dynamic energy consumption alone degrades performance by around 5%, and optimizing for performance alone increases dynamic energy consumption by 20%. The average and maximum sizes of the Paretooptimal fronts are 2.4 and 5.
The execution time of building the four dimensional discrete graph with performance and dynamic energy as two objectives and the two decision variables can be costprohibitive for its employment in dynamic schedulers and selfadaptable dataparallel applications. We will explore approaches to reduce this time in our future work.
7.4 Analysis Using Performance and Dynamic Energy Models
In this section, we propose a qualitative dynamic energy model employing performance monitoring counters (PMCs) as parameters. The model reveals the cause behind the energy nonproportionality in modern multicore CPUs. The model along with the execution time of the application is used to analyze the Paretooptimal front determined by our solution method on a dualsocket multicore platform.
PMCs are specialpurpose registers provided in modern microprocessors to store the counts of software and hardware activities. The acronym PMCs is used to refer to software events, which are pure kernellevel counters such as pagefaults, contextswitches, etc. as well as microarchitectural events originating from the processor and its performance monitoring unit called the hardware events such as cachemisses, branchinstructions, etc. Software energy predictive models based on PMCs is one of the leading methods of measurement of energy consumption of an application [51].
The experimental platform S2 and the application OpenBLASDGEMM is employed for the analysis. Likwid tool [52] is used to obtain the PMCs. On this platform, it offers 164 PMCs, which are divided into 28 groups (L2CACHE, L3CACHE, NUMA, etc.). The groups are listed in the supplemental. All the PMCs for each workload size executed using different application configurations, (#threadgroups (g), #threads_per_group (t)) are collected. Each PMC value is the average for all the 24 physical cores. We analyzed the data to identify the major performance groups, which are highly correlated with the dynamic energy consumption. The highest correlation is contained in the data provided by TLB_DATA performance group. This group provides data activity, such as load miss rate, store miss rate and walk page duration, in L1 data translation lookaside buffer (dTLB), a small specialized cache of recent page address translations. If a dTLB miss occurs, the OS goes through the page tables. If there is a miss from the page walk, a page fault occurs resulting in the OS retrieving the corresponding page from memory. The duration of the page walk has the highest positive correlation with dynamic energy consumption based on our experiments.
Combination (g, t)  Dynamic Energy (J)  Time (sec)  L1 dTLB load miss duration (Cyc)  L1 dTLB store miss duration (Cyc) 
(1,48)  824.2743  14.112  108.373  124.326 
(4,12)  740.0211  14.177  113.515  105.363 
(8,6)  729.1005  14.244  104.564  89.3753 
(2,24)  802.6687  14.314  105.328  82.5185 
(16,3)  750.6159  14.615  100.924  90.2733 
(3,16)  631.3098  14.772  97.9180  76.1889 
(6,8)  667.4856  14.818  96.8957  58.0210 
(12,4)  528.0411  15.057  97.0492  52.8966 
(24,2)  1352.141  15.875  100.106  82.7514 
(48,1)  1719.012  18.685  111.902  85.9282 
Combination (g, t)  Dynamic Energy (J)  Time (sec)  L1 dTLB load miss duration (Cyc)  L1 dTLB store miss duration (Cyc) 
(4,12)  1320.0702  16.2478  105.961  122.191 
(1,48)  1271.5506  16.3034  99.5398  63.7090 
(8,6)  1266.3294  16.3166  95.7896  58.9096 
(2,24)  1287.6882  16.4498  98.2180  74.6859 
(16,3)  1250.5616  16.6824  95.2988  58.3551 
(6,8)  1130.2412  16.9668  93.4336  47.9097 
(3,16)  1052.0283  17.0187  90.5275  45.7483 
(24,2)  1824.5795  18.0755  106.804  55.5686 
(12,4)  1795.7680  20.5520  93.6595  46.5541 
(48,1)  2164.1212  20.9868  96.6999  71.4943 
Nonnegative multivariate regression is employed to construct our model of dynamic energy consumption based on the PMC data from dTLB. The model is shown below:
(1) 
where is the average CPU utilization, and are the regression coefficients for the PMC data. is the execution time of the application, is the time of page walk caused by load miss and is the time of page walk caused by store miss in dTLB. The coefficients of the model ({}) are forced to be nonnegative to avoid erroneous cases where large values for them gives rise to negative dynamic energy consumption prediction violating the fundamental energy conservation law of computing.
To test this model, we use two workload sizes 16384 and 17408. The PMC data that is obtained for these sizes and that is used to train the model is shown in the tables II and III. The rows of the tables are sorted in increasing order of time. The blue colour in the tables shows the rows that are in the Paretooptimal front. The time of page walk (last two columns, 4 and 5) is measured in cycles. As can be seen from the tables, the dynamic energy decreases as the number of cycles decreases.There is however a tradeoff between the execution time of application and the page walk time. For a Paretooptimal solution, a long execution time corresponds to smaller number of load and store cycles and thereby less dynamic energy consumption.
Two dynamic energy models for the workload sizes 16384 (Table II) and 17408 (Table III) were constructed. The coefficients for the workload size 16384 are {}. The coefficients for the workload size 17408 are {}. We then predict the dynamic energy consumption using the model and compare with the dynamic energy measured using HCLWattsUp. The Figures (a)a and (b)b illustrate the comparison. The axis represents the number of a row in the Tables II, III. The modeled dynamic energy demonstrates the same trend as the measured dynamic energy using HCLWattsUp.
TLB activity has been the focus of research in [53, 54, 55] where the authors state that the address translation using the TLB consumes as much as 16% of the chip power on some processors. The authors propose different strategies to improve the reuse of TLB caches. Our solution method employing threadgroups (or grouping using multithreaded kernels) allows to fill the page tables more evenly and reduce the duration of page walk resulting in less dynamic energy consumption.
To summarize, our proposed dynamic model based on parameters reflecting TLB activity (the duration of page walk) shows that the energy nonproportionality on our experimental platforms for the dataparallel applications is due to the activity of the data translation lookaside buffer (dTLB), which is disproportionately energy expensive. This finding may encourage the chip design architects to investigate and remove the nonproportionality in these platforms. There may be other causes behind the lack of energy proportionality as the range of applications and platforms is broadened that we would explore in our future research.
8 Conclusion
Energy proportionality is the key design goal followed by architects of modern multicore CPUs. One of its implications is that optimization of an application for performance will also optimize it for energy. However, due to the inherent complexities of resource contention for shared onchip resources, NUMA, and dynamic power management in multicore CPUs, stateoftheart applicationlevel optimization methods for performance and energy [3, 4, 5, 21], demonstrate that the functional relationships between performance and workload size and between dynamic energy and workload size for reallife dataparallel applications have complex (nonlinear) properties and show that workload distribution has become an important decision variable.
This motivated us to explore indepth the influence of threedimensional decision variable space on biobjective optimization of applications for performance and energy on multicore CPUs. The three decision variables are: a). The number of identical multithreaded kernels (threadgroups) involved in the parallel execution of an application; b). The number of threads in each threadgroup; and c). The workload distribution between the threadgroups. We focused exclusively on the first two decision variables in this work.
By experimenting with these decision variables, we discovered that energy proportionality does not hold true for modern multicore CPUs. Based on this finding, we proposed the first applicationlevel optimization method for biobjective optimization of multithreaded dataparallel applications for performance and energy on a single multicore CPU. The method uses two decision variables, the number of identical multithreaded kernels (threadgroups) and the number of threads in each threadgroup. A given workload is partitioned equally between the threadgroups.
We demonstrated our method using four highly optimized multithreaded dataparallel applications, 2D fast Fourier transform based on FFTW and Intel MKL, and dense matrixmatrix multiplication written using Openblas DGEMM and Intel MKL, on four modern multicore CPUs one of which is a single socket multicore CPU and the other three dualsocket with increasing number of physical cores per socket. We showed in particular that optimizing for performance alone results in significant increase in dynamic energy consumption whereas optimizing for dynamic energy alone results in considerable performance degradation and that our method determined good number of globally Paretooptimal solutions.
Finally, we proposed a qualitative dynamic energy model employing performance monitoring counters (PMCs) as parameters, which we used to explain the Paretooptimal solutions determined for modern multicore CPUs. The model showed that the energy nonproportionality on our experimental platforms for the two dataparallel applications is caused by disproportionately high energy consumption by the data translation lookaside buffer (dTLB) activity.
Acknowledgments
This publication has emanated from research conducted with the financial support of Science Foundation Ireland (SFI) under Grant Number 14/IA/2474. We thank Roman Wyrzykowski and Lukasz Szustak for allowing us to use their Intel servers, HCLServer03 and HCLServer04.
9 Supplementary Material
9.1 Rationale Behind Using Dynamic Energy Consumption Instead of Total Energy Consumption
There are two types of energy consumptions, static energy, and dynamic energy. We define the static energy consumption as the energy consumption of the platform without the given application execution. Dynamic energy consumption is calculated by subtracting this static energy consumption from the total energy consumption of the platform during the given application execution. The static energy consumption is calculated by multiplying the idle power of the platform (without application execution) with the execution time of the application. That is, if is the static power consumption of the platform, is the total energy consumption of the platform during the execution of an application, which takes seconds, then the dynamic energy can be calculated as,
(2) 
We consider only the dynamic energy consumption in our work for reasons below:

Static energy consumption is a constant (or a inherent property) of a platform that can not be optimized. It does not depend on the application configuration.

Although static energy consumption is a major concern in embedded systems, it is becoming less compared to the dynamic energy consumption due to advancements in hardware architecture design in HPC systems.

We target applications and platforms where dynamic energy consumption is the dominating energy dissipator.

Finally, we believe its inclusion can underestimate the true worth of an optimization technique that minimizes the dynamic energy consumption. We elucidate using two examples from published results.

In our first example, consider a model that reports predicted and measured total energy consumption of a system to be 16500J and 18000J. It would report the prediction error to be 8.3%. If it is known that the static energy consumption of the system is 9000J, then the actual prediction error (based on dynamic energy consumptions only) would be 16.6% instead.

In our second example, consider two different energy prediction models ( and ) with same prediction errors of 5% for an application execution on two different machines ( and ) with same total energy consumption of 10000J. One would consider both the models to be equally accurate. But supposing it is known that the dynamic energy proportions for the machines are 30% and 60%. Now, the true prediction errors (using dynamic energy consumptions only) for the models would be 16.6% and 8.3%. Therefore, the second model should be considered more accurate than the first.

9.2 Shared Memory Implementations of PMMTG Algorithms
The shared memory implementations of PMMTG algorithms using Intel MKL and OpenBLAS are described here. The inputs to an implementation are: a). Matrices A, B, and C of sizes ; b). Constants and ; c) The number of abstract processors (groups), ; d). The number of threads in each abstract processor (group) represented by . The output matrix, C, contains the matrix product. Each abstract processor is a group of threads.
The implementations using Intel MKL differ from those using OpenBLAS. In Intel MKL, the matrixmatrix computation specific to a partition is computed using an OpenMP parallel region with threads whereas the same is computed in OpenBLAS using a pthread.
9.2.1 Intel MKL implementation of PMMTGV
Figure 23 shows the implementation of PMMTGV using Intel MKL.
9.2.2 OpenBLAS implementation of PMMTGV
Figure 24 shows the implementation of PMMTGV using OpenBLAS.
9.2.3 Intel MKL implementation of PMMTGS
Figure 25 shows the implementation of PMMTGS using Intel MKL.
9.2.4 OpenBLAS implementation of PMMTGS
Figure 26 shows the implementation of PMMTGS using OpenBLAS.
9.2.5 Intel MKL implementation of PMMTGH
Figure 27 shows the implementation of PMMTGH using Intel MKL.
9.3 Shared Memory Implementations of PFFT Algorithms
The inputs to an implementation are: a). Signal matrix of size ; b). The number of abstract processors (groups) , ; c). The number of threads in each abstract processor (group) represented by . The output is the transformed signal matrix (considering that we are performing inplace FFT). Each abstract processor is a group of threads.
The implementations using Intel MKL differ from those using FFTW. In FFTW, only plan execution (fftw_plan_many_dft) and plan destruction (fftw_destroy_plan) are threadsafe and called be called in an OpenMP parallel region.
9.3.1 Intel MKL implementation of PFFTTGH
Figure 28 shows the implementation of PFFTTGH using Intel MKL.
9.4 Transpose Routine Invoked in PFFT Algorithms
The routine, hcl_transpose_block, shown in the Figure 29 performs inplace transpose of a complex 2D square matrix of size . We use a block size of 64 in our experiments as it is found to be optimal.
9.5 Application Programming Interface (API) for Measurements Using External Power Meter Interfaces (HCLWattsUp)
HCLServer01, HCLServer02 and HCLServer03 have a dedicated power meter installed between their input power sockets and wall A/C outlets. The power meter captures the total power consumption of the node. It has a data cable connected to the USB port of the node. A perl script collects the data from the power meter using the serial USB interface. The execution of this script is nonintrusive and consumes insignifcant power.
We use HCLWattsUp API function, which gathers the readings from the power meters to determine the average power and energy consumption during the execution of an application on a given platform. HCLWattsUp API can provide following four types of measures during the execution of an application:

TIME—The execution time (seconds).

DPOWER—The average dynamic power (watts).

TENERGY—The total energy consumption (joules).

DENERGY—The dynamic energy consumption (joules).
We confirm that the overhead due to the API is very minimal and does not have any noticeable influence on the main measurements. It is important to note that the power meter readings are only processed if the measure is not hcl::TIME. Therefore, for each measurement, we have two runs. One run for measuring the execution time. And the other for energy consumption. The following example illustrates the use of statistical methods to measure the dynamic energy consumption during the execution of an application.
The API is confined in the hcl namespace. Lines 10–12 construct the Wattsup object. The inputs to the constructor are the paths to the scripts and their arguments that read the USB serial devices containing the readings of the power meters.
The principal method of Wattsup class is execute. The inputs to this method are the type of measure, the path to the executable executablePath, the arguments to the executable executableArgs and the statistical thresholds (pIn) The outputs are the achieved statistical confidence pOut
, the estimators, the sample mean (
sampleMean) and the standard deviation (
sd) calculated during the execution of the executable.The execute method repeatedly invokes the executable until one of the following conditions is satisfied:

The maximum number of repetitions specified in