Optimization of hybrid parallel application execution in heterogeneous high performance computing systems considering execution time and power consumption

by   Paweł Rościszewski, et al.

Many important computational problems require utilization of high performance computing (HPC) systems that consist of multi-level structures combining higher and higher numbers of devices with various characteristics. Utilizing full power of such systems requires programming parallel applications that are hybrid in two meanings: they can utilize parallelism on multiple levels at the same time and combine together programming interfaces specific for various types of computing devices. The main goal of parallel processing is increasing the processing performance, and therefore decreasing the application execution time. The international HPC community is targeting development of "Exascale" supercomputers (able to sustain 10^18 floating point operations per second) by the year 2020. One of the main obstacles to achieving this goal is power consumption of the computing systems that exceeds the energy supply limits. New programming models and algorithms that consider this criterion are one of the key areas where significant progress is necessary in order to achieve the goal. The goal of the dissertation is to extract a general model of hybrid parallel application execution in heterogeneous HPC systems that is a synthesis of existing specific approaches and developing an optimization methodology for such execution aiming for minimization of the contradicting objectives of application execution time and power consumption of the utilized computing hardware. Both meanings of the application hybridity result in multiplicity of execution parameters of nontrivial interdependences and influence on the considered optimization criteria. Mapping of the application processes on computing devices has also a significant impact on these criteria.



There are no comments yet.


page 22


Application Checkpoint and Power Study on Large Scale Systems

Power efficiency is critical in high performance computing (HPC) systems...

Performance and energy footprint assessment of FPGAs and GPUs on HPC systems using Astrophysics application

New challenges in Astronomy and Astrophysics (AA) are urging the need fo...

A Resourceful Coordination Approach for Multilevel Scheduling

HPC users aim to improve their execution times without particular regard...

Efficiency Near the Edge: Increasing the Energy Efficiency of FFTs on GPUs for Real-time Edge Computing

The Square Kilometre Array (SKA) is an international initiative for deve...

Catch Me If You Can: Using Power Analysis to Identify HPC Activity

Monitoring users on large computing platforms such as high performance c...

Programming at Exascale: Challenges and Innovations

Supercomputers become faster as hardware and software technologies conti...

OpenCL Performance Prediction using Architecture-Independent Features

OpenCL is an attractive model for heterogeneous high-performance computi...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1.1 Motivations

The multiplicity of the possible configurations of hybrid parallel application execution stems from various parameters of the execution and feasible allocations of hardware to individual parts of the application. An example of an execution parameter connected with the utilized hardware is the number of threads executed in parallel. Figure 1.1 presents the influence of this parameter on execution times of a parallel regular expression matching application (see Section 4.1.2) and a geospatial interpolation application (see Section 4.1.3) depending on the number of used threads on two different computing devices. The first device is a GTX 480 GPU belonging to the HPC system described in Section 4.2.1 and the second device is an Intel Xeon Phi 7120P computing accelerator belonging to the HPC system described in Section 4.2.4.


(a) Regular expression matching on a gpu


(b) Geospatial interpolation on Intel Xeon Phi
Figure 1.1: Influence of the number of used threads on execution time depending on application and device

The charts show that the same execution parameter can take completely different values depending on the specific application and utilized computing device - for the GPU the optimal value is in the order of thousands of threads and for the Xeon Phi accelerator, dozens of threads. Efficient execution of a multi-level application running in a heterogeneous HPC system may require individual tuning of the execution parameters for each utilized device. Finding a single optimal value may be also non-trivial due to the presence of local optima.

The multiple parameters of the execution may also originate from the application domain. An example of such parameter is size of packages into which the input data should be partitioned for parallel execution. The application described in Section 4.1.4, given a number of points in a high-dimensional space, computes similarity measures between the points in parallel by distributing data packets among the available computing devices. The influence of the number of points in such a data packet on the execution time on the HPC system described in Section 4.2.1 for a space of 200 000 dimensions is shown in Figure 1.2.


Figure 1.2: Influence of data partitioning on the execution time depending on problem size [10]

The chart shows that there is a certain optimal value of this parameter which differ depending on the problem size. In the case of 1000 data points, the best partitioning is to 30 points in data packet, but in the case of 2000 data points it is 50. There can be dozens of such parameters in one application execution and their optimal values may depend on each other. Indication of such parameters for specific applications in various fields and examination of their interdependence and influence on execution time and power consumption of the application is an important research direction. Multiple possibilities of mapping parts of the application to computing devices cause that the space of configurations is even more complex.

Many existing approaches focus on tuning specific execution parameters or solving a specific scheduling problem, with regard to both execution time and power consumption optimization goals, which are often formulated using analytical formulas. Defining a specific formula for execution time and power consumption is often required for each individual application, which is complicated and requires full a priori knowledge about the components of the application and their behavior. In the light of contemporary heterogeneous and multi-level systems there is a need for analysing these approaches, experiments with practical applications from various fields and extraction of a general execution model and optimization methodology that includes all the important aspects and takes into account the opposing objectives of execution time and power consumption.

In many existing approaches, evaluating an execution configuration requires running an actual execution with this configuration. Considering high dimensionality of the configuration space and prohibitively long execution times, there is a need for a method for fast and accurate evaluation of configurations. A modeling and simulation method could be used for this purpose. Research in the field of modeling and simulation of parallel application execution and experiments involving simulation of real applications is needed to verify if a simulation approach can provide fast and accurate estimations of execution time and power consumption and, thus, support optimization of practical applications.

1.2 Problem Formulation

The process optimized within this thesis is execution of a hybrid parallel application in a heterogeneous high performance computing system. The primary components of a hybrid parallel application are computation and communication operations:

Definition 1.

Let compset denote the set of all possible computation operations, where is a computation operation.

Definition 2.

Let commset denote the set of all possible communication operations, where is a communication operation.

A computation operation utilizes a certain computing device in order to perform computations, while a communication operation utilizes a certain network link to perform communication between two processes. Operations are characterized by operation parameters, for example computational complexity of a computation operation or data size for a communication operation.

The operations are executed within processes:

Definition 3.

Let denote a set of all possible processes, where parallel process is a sequence of computation and communication operations () where .

Parallelism of a hybrid parallel application can be taken into account by the proposed model on two levels: first, there might be many processes in an application which are executed simultaneously on different hardware. Secondly, the process itself may execute the computations in parallel. For example, an application consisting of of many mpi processes which execute parallel codes on gpus is parallel at both levels.

A hybrid parallel application consists of one or more process implementations, which define the sequence of operations in each process of the application:

Definition 4.

Let denote a set of process implementations provided by a hybrid parallel application app.

A process implementation results from the code of the application, but can be also represented as an arbitrary schedule of operations, computer program or log from a completed application. The operation sequences of two processes

may also depend on each other if at some moment of execution there occurs communication between these two processes.

An application has certain process requirements:

Definition 5.

Let vectors denote process requirements of an application, where are minimal and maximal numbers of instances of process that are required by an application to be executed.

The process requirements may for example depend on the parallel paradigm of the application. For instance, a master/slave application would require exactly one master (). It would also require at least one slave () and the maximal number of required slaves could be a value depending on data partitioning limitations or none (infinity).

Summarizing, a hybrid parallel application is defined as follows:

Definition 6.

Let denote a hybrid parallel application, where is a set of process implementations of the application and and are minimal and maximal process requirements of the application.

It should be noted that a set of independent hybrid parallel applications can be modeled, where processes of each application never communicate with processes of another application. Thus, we can still treat such a set of applications as one hybrid parallel application.

Hybrid parallel applications are executed on heterogeneous hpc systems, also called heterogeneous computing systems (hcs):

Definition 7.

Let a graph denote heterogeneous hpc system, where is a set of computing devices and is a set of network links between computing devices - network link between devices and ).

The computing devices represented by are hardware devices capable of performing computations. Depending on the considered granularity they might represent computing cores, processing units, computer nodes, groups of nodes, high performance clusters, voluntary computing systems etc. Analogously, a network link can represent a system bus, local area network (lan), wide area network (wan) etc. Computing devices and network links are characterized by hardware parameters, for example computational performance of a computing device or bandwidth of a network link. Utilizing hardware with different hardware parameters makes a system heterogeneous.

Each computing device has certain hardware capabilities, which resemble technical possibility to execute certain processes (depending on software stack possibilities, available computing capabilities etc.):

Definition 8.

Let function denote hardware capabilities defining how many instances of process can be executed on device .

Execution of an application depends also on application execution parameters:

Definition 9.

Let denote the space of application execution parameters feasible for a hybrid parallel application app and a heterogeneous HPC system system.

The space of application execution parameters can potentially be highly dimensional. Multiple examples of execution parameters are described in Chapter 2. The application execution parameters consist of two groups:

  • application parameters - related to the algorithms used in the application, parallel paradigm, data partitioning, assumed problem constraints, buffer sizes, etc.;

  • execution parameters - related to possible configurations of the hardware used at multiple parallelization levels, including numbers of threads, thread affinity, gpu grid configurations, dvfs modes etc.

For a specific execution of a hybrid parallel application in a heterogeneous hpc system, specific values of the application execution parameters have to be set:

Definition 10.

Let denote a vector of application execution parameters of a hybrid parallel application.

Also, the processes have to be mapped to specific computing devices:

Definition 11.

Let function denote process mapping of a hybrid parallel application to a heterogeneous hpc system, where defines how many instances of process should be run on a computing device during a certain application execution.

It should be noted that given a hybrid parallel application app and a heterogeneous HPC system system, a feasible process mapping function belongs to the following feasible set of process mapping functions :

The primary outcomes of executing a hybrid parallel application are the results that depend on the purpose of the application. The optimization problem considered in this thesis assumes that these primary results are correct and focuses on the secondary results of the execution, related to execution time and power consumption:

Definition 12.

Let function denote execution time of a hybrid parallel application app on a heterogeneous hpc system system with process mapping function mapping and vector of application execution parameters executionparameters. Specifically, the execution time is defined as the time from the start of the application execution (which involves starting all process instances in the application) until finishing of all process instances on the devices assigned by the process mapping function: where processexecutiontime(process, device) is execution time of a parallel process process on device device.

Definition 13.

Let function denote average power consumption of all computing devices of a heterogeneous hpc system system during the execution a hybrid parallel application app with process mapping function mapping and vector of application execution parameters executionparameters. Specifically, the average power consumption is defined as the total energy consumption (including while idle) of all computing devices in the system throughout the application execution divided by the execution time: where deviceenergyconsumption(device, app) is the total energy consumption of a computing device device during the execution of a hybrid parallel application app.

Definition 14.

Let function denote maximum power consumption of all computing devices of a heterogeneous hpc system system during the execution a hybrid parallel application app with process mapping function mapping and vector of application execution parameters executionparameters: where powercons(device, ) is the power consumption of a computing device device at time .

The following multi-objective optimization problem, is solved within this thesis, with the objective space consisting of execution time executiontime and average power consumption powerconsumption. Given a hybrid parallel application app and a heterogeneous high performance computing system system:

subject to

Pareto method is used for multi-objective optimization where the expected solution to the optimization problem is a set of Pareto-optimal points from the search space. The search space consists of the feasible set of process mapping functions  and the space of feasible application execution parameters . A point is Pareto-optimal if every other point in the search space results in higher execution time or higher average power consumption than this point. The Pareto method is described in more detail in Section 3.1.1.

In this thesis we assume that the hpc system system is given and cannot be modified, although optimization that considers introducing new, currently unavailable hardware to the system can be considered as future work. Changes to the processes of the application app are possible, but limited. The computational goals of the application have to be maintained, so the optimization of the processes can be only performed manually by a specialist. Such optimizations are included in the execution steps proposed in this thesis as an optional first step of preliminary process optimization, after which the application app cannot be modified. The remaining parameters of the execution can be optimized automatically and define the search space of the considered process mapping and parameter tuning problem.

The following related problems are also considered in the thesis:

  • Optimization of execution time under power consumption constraints - given a hybrid parallel application app, heterogeneous high performance computing system system and power consumption limit powerlimit:

    subject to
  • Optimization of execution time - given a hybrid parallel application app and heterogeneous high performance computing system system:

    subject to

The model of hybrid parallel application execution in a heterogeneous hpc system proposed in this Section is referenced by Claim 1 of this dissertation.

1.3 Scope of the Dissertation

The work described in this thesis is related to six main areas:

  • specific hybrid parallel applications and their implementations;

  • solutions for executing hybrid parallel applications in heterogeneous hpc systems;

  • modeling and simulation of parallel applications in heterogeneous hpc systems;

  • multi-objective optimization of parallel applications;

  • energy-aware resource management in heterogeneous hpc systems;

  • parameter auto-tuning in parallel applications.

Figure 1.3 presents to what extent the contributed papers related to this thesis are relevant to each of these main areas.


Figure 1.3: Map of contributed papers and their relevant fields

In the area of specific hybrid parallel applications, the scope of this thesis includes an overview of existing applications in Section 2.1.2 and the applications described in Section 4.1, used in the experiments. In the most application-oriented paper [11] we proposed a regular expression matching application with configurable data intensity for testing heterogeneous hpc systems. Certain descriptions and analyses of the corresponding considered applications have also been provided in papers [10, 12, 13, 14, 15, 16].

Chosen existing approaches to executing parallel applications in heterogeneous hpc systems have been described in Section 2.2. The framework for multi-level high performance computing using clusters and workstations with cpus and gpus introduced in [13] is proposed in this thesis as a software solution for executing the proposed optimization methodology. Execution of a parallel deep neural network training application on a cluster with multiple gpus is described in [15]. An approach to using distributed databases for data management in heterogeneous hpc systems is proposed in [17]. Certain aspects regarding execution of the considered applications have been also discussed in papers [11, 18, 19, 20].

An important contribution of this thesis is the method for modeling and simulation of hybrid parallel application execution on heterogeneous hpc systems described in Section 5.2. Section 2.3 provides an overview of existing approaches to simulation and modeling in this field. In [21] we discussed the existing simulation systems and provided motivations for a new discrete-event simulation environment introduced in [14]. An example of execution time modeling using this environment has been described in [10], and an example of modeling energy consumption in [12]. In [16] we proposed an approach, described also in Section 6.2.2, to exploring power/time trade-off of a deep neural network training application using this simulation environment. Additionally, simulation has been used in the approach to optimization of execution time under power consumption constraints proposed in [19] and described in Section 6.1.2 to determine optimal data partitioning. In [20], a method of configuring process labels in the system model has been proposed that simplifies executing simulations with fine-grained granularity.

The problem solved within this thesis, defined in Section 1.2, is a multi-objective optimization problem. In Section 3.1 we discuss approaches to multi-objective optimization of parallel applications and choose the Pareto method for the approach proposed in this thesis. In [19] we focused on the problem of execution time optimization under power consumption constraints defined in Equation 1.2. Pareto method has been used for exploring the power/time trade-off of a parallel deep neural network training application described in [16] and in Section 6.2.2.

The task mapping aspect of the optimization problem solved within this thesis, connected with finding the optimal process mapping function mapping lies in the field of resource management. The approach proposed in this thesis is compared to chosen approaches to energy-aware resource management in heterogeneous hpc systems in Section 3.2. The task mapping problem has been considered in the context of optimization of execution time under power consumption constraints in [19]. Resource management has been also considered in [17] in the context of network-aware scheduling and in [18] in the context of computing system monitoring. Finding the optimal number of used computing devices in [16] is also connected with resource management.

The aspect of the considered optimization problem connected with finding the optimal set of application execution variables executionparameters lies in the field of parameter auto-tuning. Chosen approaches to parameter auto-tuning in parallel applications are described in Section 3.3. Auto-tuning of execution variables such as thread numbers, gpu grid configurations and data partitioning is an important part of the optimization methodology proposed in [13] and the generalized version in Chapter 5. Data partitioning has been also tuned in [19] and [10]. In [16] the tuned variable is connected with resource mapping, namely the number of devices used for computations.

The problem solved in this thesis combines the following problems from the fields of multi-objective optimization, resource management and parameter auto-tuning:

  • multi-objective optimization of execution time and power consumption problem;

  • suboptimal-approximate energy-aware global static task mapping problem;

  • offline auto-tuning of system parameters problem

into one bi-objective Pareto optimization problem of execution time and power consumption.

1.4 Main Contributions of the Dissertation

This work contains the following original contributions made by the author:

  • proposition of a new model of hybrid parallel application execution in heterogeneous hpc systems described in Section 1.2 that focuses on execution time and power consumption of processes consisting of computation and communication operations and considers process mapping and application execution parameters;

  • implementations connected with specific hybrid parallel applications, namely: heterogeneous OpenCL implementation of the regular expression matching application described in Section 4.1.2; multi-level heterogeneous OpenCL + MPI implementation of the application described in Section 4.1.3 and integration of the application with the KernelHive framework (this air pollution interpolation application was used in the SmartCity system developed for the local government of the city of Gdańsk, Poland); implementation of the model of the large vector similarity measure computation application described in Section 4.1.4 in the MERPSYS environment; extension of the existing deep neural network training application described in Section 4.1.5 by MPI message passing for multi-level execution and implementation of the model of this application in the MERPSYS environment (the application was used for acoustic model development by VoiceLab.ai, a company based in Gdańsk, Poland);

  • proposition of specific execution steps for hybrid parallel applications in heterogeneous hpc systems described in Section 5.1 consisting of preliminary process optimization, process mapping, parameter tuning and actual execution, and implementation of these steps in the KernelHive framework;

  • co-design of the simulation method of hybrid parallel application execution in heterogeneous hpc systems described in Section 5.2 and proposition of using this method for fast evaluation of process mappings and application execution parameter values in the multi-objective execution time and power consumption optimization. Co-implementation of this method within the MERPSYS simulation environment. Specifically, implementation of the scheduler mechanism, framework for executing multiple parallel simulations, power consumption computation and multiple improvements of the simulator;

  • demonstration of the proposed execution steps on the example of multi-level task farming applications described in Section 6.1 including computation and communication overlapping, network-aware and power constrained scheduling, tuning of grid configurations and data partitioning and execution in heterogeneous hpc systems with cpus and gpus using the KernelHive framework;

  • demonstration of the proposed optimization methodology as a whole including the execution steps and simulation method on the example of a deep neural network training application described in Section 6.2 including overlapping of training and data preprocessing, power-aware device selection and execution on a professional cluster of workstations with gpus using the contributed multi-level implementation;

1.5 Claims of the Dissertation

  1. The execution steps specific in the context of the proposed model, including preliminary process optimization, process mapping, parameter tuning and actual execution allow to optimize execution time of hybrid parallel applications in heterogeneous high performance computing systems. Empirical proofs for this claim are provided in Chapter 6. Specifically, proofs related to preliminary process optimization are presented in Sections 6.1.1 and 6.2.1, process mapping in Sections 6.1.2 and 6.2.2, parameter tuning in Sections 6.1.3 and 6.2.2, and actual execution in Sections 6.1.4 and 6.2.3.

  2. The proposed modeling and simulation method allows for fast and accurate identification of the set of Pareto-optimal solutions to the problem of multi-objective execution time and power consumption optimization of hybrid parallel applications in heterogeneous high performance computing systems. Empirical proofs for this claim are provided in Section 6.2.2.

1.6 Overview of the Dissertation

The remainder of this thesis is organized as follows. A state of the art review is provided in Chapters 2 and 3. Chapter 2 discusses hybrid parallel applications, examples of applications from various fields, software solutions for executing them in heterogeneous hpc systems and approaches to modeling and simulation of their execution. Chapter 3 covers approaches to optimization of parallel application execution in the contexts of multi-objective execution time and power consumption optimization, energy-aware resource management and parameter auto-tuning.

In Chapter 4 specific applications and systems are described, that have been used in the experiments within the thesis. The main contribution of this thesis is the optimization methodology proposed in Chapter 5. An empirical evaluation of the methodology, described in Chapter 6, is based on case studies involving optimization of specific executions of hybrid parallel applications in heterogeneous hpc systems. Finally, conclusions and discussion about the possible future work are provided in Chapter 7.

2.1 Hybrid Parallel Applications

We characterize the common denominator shared by parallel applications which could benefit from the methodology proposed in this thesis by naming them hybrid parallel applications. In Section 2.1.1 we explain our interpretation of the term hybrid in the context of parallel applications by comparing it to existing works. In Section 2.1.2 we provide examples of specific applications in the fields similar to the applications considered in our experiments. Possible application parameters are emphasized.

2.1.1 Hybridity of the Applications

The term hybrid in the context of parallel applications appears in two main meanings. First, mixing different types of computing devices [22, 23, 24, 25], for example cpu + gpu. Second, mixing programming APIs on different parallelism levels [26, 27, 28], for example mpi + OpenMP. In both cases the aim is utilizing more computing devices in order to achieve better performance or performance/power consumption ratio. The first meaning implies the heterogeneity of the utilized hpc system, and the second meaning implies that the system is multi-level. It is worth noting that, in essence, both these meanings are related to the properties of the hpc system, namely that it is heterogeneous and multi-level. Examples of such systems are discussed in Section 2.2.3.

Having said that, the term hybrid may seem redundant in the context of parallel applications themselves. However, we decided to emphasize this word due to another meaning. A crucial factor of the model of parallel application execution proposed in this thesis is the set of parameters which influence its execution. In a heterogeneous multi-level system these parameters may include various aspects of execution, often related to certain types of computing devices or parallelization levels. The term hybrid is gaining importance in the cases where changing one of these application parameters influences the optimal value of another.

For example, the term hybrid has been used in the context of a hybrid Xeon/Xeon Phi system in [24]. Within the paper, a parallel application for computing similarity measures between large vectors is optimized for scalability in a system consisting of Intel Xeon cpus along with Intel Xeon Phi coprocessors. Proposed optimizations include load balancing, loop tiling, overlapping and thread affinity setting. The system is hybrid in the sense of utilizing different types of computing devices, which makes it a heterogeneous computing system. What is more significant, executing computations on the Intel Xeon Phi influences the optimal number of cores used on the host processor for computations: one core should be left free, so it can be efficiently used for handling the accelerator.

Situations when the optimal values of execution parameters depend on each other often occur in practical approaches. In the above example, changing one execution parameter (whether or not to use the coprocessor) influences the optimal value of another execution parameter (number of used cores of the host processor used for the computations). Such relationships concern also application-specific parameters. For instance, when the authors of [29] fixed one of three application parameters of a GS2 physics application and plotted the performance as a function of two other application parameters, it turned out that the optimization surface was not smooth and contained multiple local minima.

The mutually dependent parameters include also meta-parameters resulting from code optimizations. One example would be an approach to optimizing parallel application execution using a combination of compiler and runtime techniques proposed in [30]. In this approach, regions of the tuned applications are subject to source-to-source transformations. Parameters of these transformations belong to the set of tuned variables. The authors observe that optimal transformation parameter values or even distinct transformation sequences depend on another execution parameter, namely the number of threads. This observation is a motivation for multi-versioning approach proposed in the paper, where a set of optimal solutions is encoded by the compiler into one executable and the runtime system dynamically chooses between the versions depending on changing circumstances and objectives.

Another example is simultaneous loop tiling and unrolling program transformations for minimizing execution time of selected numeric benchmarks in [31]. The authors claim that combining the best tiling transformation with the best unrolling factor does not necessarily give the best overall transformation. What is more, it is shown that small deviations from optimal tile sizes and unroll factors can cause such an increase in execution time, so that it is even higher than in the original program. Previously used static cost models which attempt to give an analytical expression for the execution time were vulnerable to such varying optimization spaces. Instead, an approach called iterative compilation is proposed, where many versions of the program are generated and executed for determining optimal values of the optimization parameters.

Although the two latter approaches concern programs that are neither multi-level nor executed on a heterogeneous platform, they illustrate non-trivial dependencies between application execution parameters, which cause a need for specific optimization algorithms. In contemporary multi-level hpc applications executed in heterogeneous hpc systems, dependencies of this type are increasingly likely, and it is getting harder and harder to describe them using analytical models.

Summarizing, in the sense of this work, a parallel application is hybrid if it is multi-level or executed in a heterogeneous hpc system, but more importantly when there are non-trivial dependencies between the application execution parameters and the optimization objectives, such as execution time and power consumption. This deeper meaning of the term hybrid is essential in the model and optimization approach proposed in this thesis, because it means that instead of finding optimal parameter values separately, in many cases there is a need to take into account the whole space of decision parameters.

2.1.2 Examples of Hybrid Parallel Applications in Selected Fields

In order to provide examples of what the discussed class of applications can be useful for, in this section we describe chosen applications from fields similar those developed and used in the experiments within this thesis, described in Chapter 4.1.

The sample application described in Section 4.1.1 is distributed MD5 hash breaking using a cluster with cpus and gpus. A similar approach to distributed password cracking has been proposed in [32], where a cluster of up to 88 gpus has been used for password recovery through brute-force attack, achieving good scalability thanks to the proposed efficient password distribution scheme. The solution is hybrid in the multi-level sense, but the computing system is homogeneous (cpus are not used for computations). In the field of cryptography, parallel applications that are hybrid in the sense of device heterogeneity are also used. For example in [33], a cluster system integrating cpu and gpu devices from various vendors is used for efficient encryption and decryption of large amounts of data. Application parameters in password cracking may be related to the method of dividing workload across the used workers, namely the number of passwords to be checked by each worker in an iteration and expected minimum/maximum password length which influence the average time of recovering the password.

The regular expression matching application described in Section 4.1.2

belongs to the field of text pattern matching. Pattern matching algorithms are widely used in signature-based network intrusion detection systems (NIDS)

[34]. The objective of such systems is to examine if incoming network packet payloads contain malicious content defined as "signatures" or "patterns" and generate alert messages for system administrators. Examples of application parameters in this problem are maximum length of a signature and size of slices into which the traffic is partitioned. Hybrid cpu/gpu pattern matching for deep packet inspection has been proposed in [25], where the incoming packets are pre-filtered using cpu and suspicious packets are sent to the gpu for complete matching. The gpu workload is reduced thanks to the cpu pre-filtering stage.

Geostatistical interpolation applications are a fundamental task in geographic information science and are used for prediction of environmental phenomena at non-observed locations. Computational cost of the used algorithms grows with the number of data points from the observed locations and the number of locations for which the interpolated values are needed. The contemporary interpolation workloads are critical, for example in weather forecast systems, and efficient implementations of geostatistical interpolation algorithms are needed. The same algorithm that the one described in Section 4.1.3 has been adapted to massively parallel computing environments in [35] and [36]. An efficient gpu implementation of another popular geostatistical interpolation method called kriging has been proposed in [37]. An exemplary application parameter in the geostatistical interpolation problem could be related to the used strategy of data point partitioning. For example, quad trees have been used in the approach to another geoscientific problem: constructing digital elevation models from high resolution point clouds acquired using LIDAR technology [27]. A hybrid mpi/gpu implementation for solving this problem has been proposed. The application is hybrid in the multi-level sense, with multiple gpu-equipped hosts independently interpolating a portion of data and assembling the final model from partial results with balancing of I/O, computation and communication in mind.

The application described in Section 4.1.4 concerns large-vector similarity measure computation. This task is a crucial part of clustering which means grouping a set of objects into classes of similar objects. Algorithms using pattern similarity have been successfully applied to large data sets in DNA microarray analysis, e-commerce applications, such as collaborative filtering [38] as well as real-time searching for similar short messages for the purposes of a social networking service with a dataset of over billion messages [39]. Faiss [28] is a recent open-source library for efficient similarity search and clustering of dense vectors. A key problem addressed by this approach is to, given a large database of objects, construct a k-NN graph - a directed graph whose nodes represent objects from the database and edges connect them to k nearest neighbors, according to one of the supported distance functions. The solution is capable of using multiple gpus on one server for constructing a high accuracy k-NN graph. For example, construction of a graph connecting 1 billion vectors in less than 12 hours on 4 NVIDIA Maxwell Titan X gpus has been reported. The presented application of the library is constructing a k-NN graph for a database of 95 million images and finding a path in this graph, resulting in a sequence of smooth transitions from a given first to a given last image. Application parameters of the large-vector similarity measure computation problem include the maximum dimensionality of an object and size of packages in which the objects are transferred to the computing devices.

Finally, there are multiple applications of parallel deep neural network training, the field of the application described in Section 4.1.5

. Many of them are hybrid in the multi-level sense, because often multiple computing devices are used for training. For example, a neural network for classifying positions in the


game according to archival expert moves was trained using asynchronous stochastic gradient descent on 50 gpus in

[40]. The training took around three weeks, because 340 million training steps were needed to contribute to the achievement of winning 99.8% games against other Go programs and defeating the human European Go

champion by 5 games to 0. There have also been hybrid cpu/gpu approaches to deep neural network training. A version of the popular deep learning framework Caffe proposed in

[41] allows using both cpus and gpus for training a deep neural network, which on a single convolutional layer achieves 20% higher throughput than only on a gpu. A hybrid cpu/gpu implementation [42]

has been also proposed for A3C, a parallel method for reinforcement learning, which can for example learn to successfully play an Atari game only from raw screen inputs. The proposed hybrid implementation generates and consumes training data for large deep neural networks up to 45 times faster than its cpu counterpart. An

application parameter in the field of parallel deep neural network training can for example be the frequency of model synchronization between many training workers.

2.2 Executing Parallel Applications in Heterogeneous HPC Systems

Depending on the hardware utilized by a parallel application, various software solutions are used for its execution. In this section we describe selected tools, frameworks and APIs used for executing parallel applications. We put particular emphasis on related execution parameters, which may belong to a set of decision variables in the optimization problem solved in this thesis. Section 2.2.1 is devoted to systems, where multiple threads can run in parallel and communicate and synchronize through shared memory. Distributed memory systems, where each process has its own private memory and some form of interconnection is needed for communication are described in Section 2.2.2. Finally, Section 2.2.3 focuses on systems that allow executing applications that are hybrid in two meanings described in Section 2.1.1 – with multiple levels of parallelization and with heterogeneous computing devices.

2.2.1 Shared memory systems

From the viewpoint of a parallel application programmer, a program running in a shared memory system typically consists of one or more threads - sequences of programmed instructions, which are executed concurrently. POSIX threads [43] is a popular parallel execution model that defines C-language functions for thread management and synchronization, which implementations are available for many operating systems. Analogous mechanisms are available for many popular programming languages, for example threading library in Python, Java threads [44] etc. If the utilized runtime supports it, some of the concurrent threads may be executed in parallel, resulting in reduction of application execution time.

Parallel applications that utilize the threading execution model are very often executed on cpus with multiple cores (multi-core). An evident execution parameter of such applications is the number of used threads. The optimal number of threads should allow to efficiently utilize available cpu cores, which does not necessarily mean that the optimal number is equal to the number of cpu cores. The capacity of utilizing many cores in parallel depends on the algorithms used by the application. Additionally, modern cpus support hardware multithreading techniques such as simultaneous multithreading (SMT), which allow multiple threads to be executed simultaneously on one core. Although the threads are completely separated from each other, running them on one core influences the computation performance. Apart from the proper number of used threads, in such non-uniform computing architectures, efficient utilization of the multi-core computing devices requires taking into account how the application threads are mapped to the available cores. This can be achieved by configuring thread affinity, which allows binding and unbinding certain threads to certain cpu cores and is another example of a parallel application execution parameter.

Modern high performance computing accelerators are equipped with dozens of physical computing cores. For example, the Knights Landing architecture used by the second generation of Intel MIC accelerators is built from up to 72 cores with possibility to run four threads per core. Shared-memory multiprocessing APIs such as OpenMP are often used to develop parallel programs for such multi-core architectures. Multithreading can be achieved using OpenMP by extending sequential C/C++ and Fortran codes with compiler directives that take care of thread creation, data sharing, synchronization, scheduling etc. For example, a C/OpenMP implementation of a parallel large vector similarity measure computation application was used in [24] to test various thread affinities, but also allocating memory in large pages for improved data transfer rate. The latter is an example of an execution parameter specific for the used device - in this case the Intel Xeon Phi coprocessor.

Arguably, the most popular computing accelerators recently are gpus (graphics processing units). They consist of several streaming multiprocessors, which in turn consist of multiple processing elements known as CUDA cores. The name is connected with the CUDA parallel computing platform and API created by NVIDIA. In this execution model, the code written in a form of a kernel is executed by multiple threads at once. For example, the NVIDIA Tesla V100 data center gpu is equipped with 5120 CUDA cores and allows to run 163840 threads simultaneously. The threads are logically aligned in a hierarchy called grid that consists of a number of blocks constructed from a number of threads. Numbers of blocks in a grid and threads in a block can be arranged by the programmer in up to three dimensions. This setting is called grid configuration and its optimal values may depend on the gpu device model, but also on the application, its computation to communication ratio, code branches etc. Finding the optimal values often requires tuning through testing the application performance for multiple combinations of the grid size parameters. The gpu grid configuration is another example of a parallel application execution parameter.

A similar hierarchical application structure can be found in OpenCL framework [45], which allows to write programs that can be executed across heterogeneous platforms, including gpus. Here, the equivalent of grid is called NDRange, consists of work groups, which in turn consist of work items. The actual mapping of this structure to the computing device architecture depends on the chosen installable client driver (ICD). Multiple OpenCL implementations offered by different vendors support various computing devices. Different NDRange configurations can be optimal depending on the OpenCL implementation and the utilized computing device, which makes the selection of appropriate execution parameters even harder if the code is executed in a heterogeneous system.

Heterogeneous systems can be also programmed using OpenACC, a programming standard aimed for simplification of programming of cpu/gpu systems. Similarly to OpenMP, C/C++ and Fortran codes can be annotated using compiler directives responsible for parallelization, with a particular emphasis on parallelizing loops. The API defines also runtime functions responsible for device and memory management. An important aspect of programming with OpenACC is optimizing data locality by providing the compiler with additional information about the data location. This allows to reuse data on the same device and minimize data transfer, which can be particularly beneficial on systems where used devices have separate memories. Selecting the appropriate strategies of data creation, copying and address management is another example of a parallel application execution parameter.

An execution parameter crucial in terms of power efficiency is setting the "gear" of a computing device using the Dynamic Voltage and Frequency Scaling (dvfs) technique. Often the most efficient setting in terms of compute capacity is not the most power efficient, because modern computing devices often have asymmetric power characteristics. For example Volta, the latest NVIDIA gpu architecture at the time of this writing, is claimed by its vendor to achieve up to 80% of the peak performance at half the power consumption. Performance vs energy trade-offs can be found on both modern cpus and gpus [46]. The effects of scaling core voltage and frequency depend not only on the computing device architecture, but also on the application characteristics [47], which makes finding the optimal setting non-trivial.

Summarizing, the optimal values of execution parameters of parallel applications executed in shared memory systems are hard to find, because they often depend on each other, on algorithms used in the application, its input/output characteristics as well as utilized hardware. Finding the appropriate value often comes down to empirical verification and tuning.

2.2.2 Distributed memory systems

Computing systems in which each computing device has its own private memory are called distributed memory systems. Programming applications for such systems requires taking care of not only process synchronization, but also data transmission between the processes. Message Passing Interface (MPI) [48] is the de facto standard communication protocol for point-to-point and collective communication, with several well-tested and efficient implementations. The MPI interface provides the essential virtual topology, synchronization and communication functionality between a set of processes. In order to achieve the best possible performance of an MPI-based application, certain runtime parameters of the used MPI implementation should be optimized for the target platform, including the cross-over point for point-to-point operations between the eager and the rendezvous protocol, network specific values such as internal buffer sizes and algorithms to be used for collective operations. OpenMPI, a popular MPI implementation has been equipped with the Open Tool for Parameter Optimization (OTPO) [49] which systematically tests large numbers of combinations of OpenMPI runtime tunable parameters to determine the best set for the given platform, based on selected benchmarks.

Message passing interfaces such as mpi allow to implement applications employing many well known parallel processing paradigms [50]. The task-farming paradigm, also known as master/slave consists of two kinds of processes. The first one, master, is responsible for decomposing the problem into small tasks and iteratively distributing them across a farm of processes of the second kind, called slaves. The slave processes perform a simple cycle: get a message with the task, process the task, send the results back to the master. Paradigm-related application parameters in this case include the number of slaves used at each iteration, task allocation and load balancing strategies as well as granularity of problem decomposition which may influence message sizes.

In the Single-Program Multiple-Data (spmd) paradigm, data required by the application is split among available processors and each process executes the same code on a different part of the data. This paradigm is especially useful for problems with a geometric structure, where communication between the processes is spatially limited, for example physical problems and simulations. Efficient execution of spmd programs in multi-core environments often requires managing communication heterogenities that cause unbalanced workload. For example, the methodology evaluated on a mpi spmd application in [51] allowed for 43% efficiency improvement through scheduling policies that allow to determine the number of tasks to be assigned to each core, which allows computation and edge communication overlapping and optimal load balancing. The number of tasks per each core is another example of an application parameter.

Another, more fine-grained paradigm which also reflects data dependencies of the application is data pipelining. Functional decomposition of the application allows to organize a pipeline of processes, each corresponding to a given stage of the algorithm. The efficiency of this paradigm depends on the capability of balancing the load across the stages of the pipeline. Efficient implementation of such pipelines becomes even more challenging in the case of real-time or near-real-time systems with small messages, due to the trade-off between the system throughput and latency. Processing the messages directly as they occur results in very frequent, small communication operations which significantly limit the throughput. Batching techniques can be used to improve the throughput at the cost of the latency. For example, incurring a small latency to group small messages together allows improving throughput in Kafka [52], a real-time publish-subscribe system that can handle more than 10 billion message writes each day. The incurred latency value can be tuned at the application level and is another example of an application parameter which optimal value may depend on the characteristics of the target system.

Divide and Conquer (dac) is a well known approach in algorithm development which can be also used as a parallel processing paradigm. In this approach a problem is decomposed into subproblems which can be solved independently and their results are combined to give the final result. The problems can be decomposed recursively, which results in an execution tree that defines a hierarchy of split, compute and join computational operations. Given appropriate complexity proportions between the splitting and joining operations compared to the computing operations, performance of the dac approach can benefit from parallel execution. For example, a multi-threaded framework based on OpenMP proposed in [53] allowed to obtain speed-ups around 90 for an irregular adaptive integration code executed on an Intel Xeon Phi accelerator. The framework allows developing parallel dac applications by coding only basic dac constructs such as data partitioning, computations and result integration, while parallelization is handled by the contributed runtime layer. The degree of parallelism can be controlled by setting two application parameters: k_max which specifies the maximum depth of the execution tree and max_thread_count which is the total number of threads that can be run at any given time.

Frameworks for automatic parallelization of computations are very much needed, for instance in the field of big data processing. For example MapReduce [54] is a programming model similar to Divide and Conquer where the programmer specifies two functions: map and reduce. Given a single key/value pair, the map function generates a set of intermediate key/value pairs. Intermediate values with the same key are merged by the reduce function. This simple model allows to express many real world tasks, especially in the field of big data processing. Due to the availability of tools for automatic parallelization of programs written in the MapReduce paradigm such as Hadoop [55], programmers can relatively easily utilize the resources of a large distributed system without any experience in parallel computing. Although input data partitioning, scheduling, failure and communication management are handled by the runtime system, certain optional execution parameters can be tuned, including maximum number of used machines and memory limits for each map and reduce task.

One limitation of the dac-based computing paradigms is that they support only acyclic data flow, while in many practical applications a working set of data needs to be reused across multiple parallel operations. Spark [56] is an example of a popular framework that supports these applications by introducing data abstraction called resilient distributed datasets - read-only collections of objects partitioned across a set of machines. An operation invoked on such a dataset results in creating a task to process each partition of the dataset and sending these tasks to preferred worker nodes using a technique called delay scheduling [57]: when the job that should be scheduled next according to fairness cannot launch a local task, it waits for a small amount of time, letting other jobs launch tasks instead. This allows for significant performance improvements for chosen applications, especially those with iterative and interactive computations. Certain execution parameters can be tuned in Spark, including job scheduling policies, data serialization, memory management policies, data locality and level of parallelism.

2.2.3 Multi-level and Heterogeneous Systems

In paper [58] hybrid parallel programming has been narrowed down to combining distributed memory parallelization on a node inter-connect with shared memory parallelization inside of each node. The presented experiments included results from executions of applications written using message passing (namely mpi) on the distributed memory level and directive-based parallelization (OpenMP) on the shared memory level. Various schemes of mpi + OpenMP programming models have been discussed and compared based on benchmarks executed on several computing platforms. The paper underlines the importance of the communication and computation overlapping optimization technique. In our work this method falls into the preliminary process optimization step, one of the execution steps proposed in Section 5.1 and was used in the experiments described in Section 6.1.2.

Authors of [59] argue that although using mpi for internode communication and another shared-memory programming model for managing intranode parallelism has become the dominant approach for utilizing multi-level systems, the significant downside of this approach is complexity of using two APIs in the same application. They propose a shared-memory programming model which is incorporated into mpi as an extension to its communication interface. The implementation allows to automatically adjust the shared-memory mapping, which for several provided use-cases resulted in improved performance.

Utilization of large-scale multi-level computing architectures using a single API is possible in the partitioned global address space (pgas) [60] parallel programming model. In this model, a number of parallel processes jointly execute an algorithm by communicating with each other via a single shared memory address space. The memory is conceptually shared (technically realized by several interconnected memories), which aims to improve programmer productivity. At the same time, the pgas model provides additional abstractions to distinguish between local and remote data accesses, in order to allow implementations aiming for high performance. About a dozen languages exists that adhere to the pgas model [60]. For example, Java programs can use the PCJ library [61] that has the ability to work on multi-node multi-core systems and hiding the details of inter- and intra-node communication. Experiments involving running a set of hpc applications show good performance scalability of the PCJ library compared to native implementations of the same algorithms in C++ with mpi and Java 8 with parallel streams mechanism, while maintaining the usability typical for the pgas model implementations.

Many solutions for executing applications utilizing a single parallelization API on multi-level computing systems are also aimed at utilizing heterogeneous devices [62, 63, 64, 65, 66, 67]. For example Many gpus Package (MGP)[62] provides C++ and OpenMP APIs that allow to extend existing applications executed on one hosting-node, so that they can transparently utilize cluster-wide devices. What is more, the package includes also an implementation of OpenCL specifications that allows executing OpenCL code on a cluster with many cpu and/or gpu devices without any modifications of the code. This reduces the complexity of programming and running parallel applications on cluster, especially since MGP provides an API for scatter-gather computations and takes care of dependencies between the split sub-tasks, queuing and scheduling them as well as managing buffers.

Frameworks for multi-level parallel computing on heterogeneous systems are also developed with specific type of application in mind. For example, TensorFlow

[3] is an interface for expressing machine learning algorithms and an implementation for their execution. TensorFlow applications are constructed from operations which constitute a dataflow graph, where each node has zero or more inputs and zero or more outputs. Many implementations can be provided for each operation that can be run on a particular type of device. This way, various device types can be used by one application in order to efficiently utilize a heterogeneous system. Various kinds of parallelism can be expressed through replication and parallelization of the graph, including model parallelism and data parallelism. Additionally, an operation implementation can be parallel itself, using for example the parallel capabilities of a gpu, which means that TensorFlow applications can be multi-level.

There are solutions for executing parallel applications in hybrid multi-level systems that take into account energy efficiency. For example the authors of [68] argue that high-end gpu-based accelerators feature a considerable energy consumption, so a solution is needed that would enable each node in a cluster to run efficient computations on a gpu while avoiding attaching a gpu to every node of the cluster. The proposed rCUDA framework for remote gpu acceleration allows this, introducing only a small overhead for chosen sample applications from CUDA SDK.

Energy efficiency has been considered also in the context of hybrid cpu/gpu architectures in [9] where GreenGPU, a framework for gpu/cpu heterogeneous architectures was proposed. The solution provides load balancing between cpu and gpu by dynamic splitting and distributing workloads based on its characteristics, so that idle energy consumption is minimized. Additionally, gpu core and memory frequency as well as cpu frequency and voltage are dynamically throttled based on their utilization. The holistic approach that includes workload division and frequency scaling achieves an average energy saving of 21.04% and only 1.7% longer execution for the kmeans and hotspot workloads from the Rodinia benchmark [69].

2.3 Modeling and Simulation of Parallel Applications in Heterogeneous HPC Systems

Formulating the optimization problems in Chapter 1 and describing the proposed optimization methodology in Chapter 5 required defining a formal model of hybrid parallel application execution in a heterogeneous hpc system. The model is also necessary for performing simulation which is proposed as a method for fast evaluation of process mappings and application execution parameters in Section 5.2.

The proposed model assumes that the a priori knowledge about the behavior of processes in the system is limited and precise analytical models of execution time and power consumption of the application are hard to formulate. Instead, a general execution model of hybrid parallel applications in heterogeneous hpc systems is proposed, that allows to express many practical applications. In this section we describe chosen existing approaches to modeling and simulation of parallel application execution in heterogeneous hpc systems. In Section 2.3.1 we focus on the system models. We divide the considerations of application modeling to execution time models described in Section 2.3.2 and power consumption models described in Section 2.3.3. Finally, in Section 2.3.4 we describe chosen simulation approaches.

2.3.1 Modeling Heterogeneous HPC Systems

A heterogeneous hpc system is basically a set of computing devices of various types connected by a network. In numerous approaches to optimization of parallel application execution, the system is modeled just as a set of computing devices with certain characteristics. For example in [70], a heterogeneous computing system is modeled as a set of heterogeneous processors with certain characteristics, including voltage and frequency. The heterogeneity of the system is accounted for by including these hardware characteristics in the considered objective functions such as performance and power consumption. The formulas used to compute the objective functions are usually defined within the system model, aimed for a certain optimization problem setup. Similarly, the model considered in this thesis assumes that execution time functions are defined for all computation and communication operations and idle and stress power consumption functions and core numbers are defined for all computing devices. The model allows for different specifications of these functions and hardware characteristics which makes expressing various specific problem formulations within the proposed framework.

Treating a hpc system simply as a set of computing devices does not take into account the network which connects the devices and its topology, however it is sufficient for many optimization problems. In the cases when network topology is considered, it is usually modeled as a graph. For example in [71], the topology of the computing platform is denoted as host graph, where each vertex denotes a processor or a switch and each edge represents a direct link or cost of communication. Although we do not put particular emphasis on network topology in our work, we decided to adopt a graph-based system model, in order to emphasize the importance of network in hpc systems and express network-aware optimizations described in Section 6.1.2.

2.3.2 Modeling Execution Time of Parallel Applications

According to the nomenclature introduced in [72], there are three basic approaches to modeling an application understood as a set of modules that make up the computation. In the first one, the modules execute independently and there is no communication between them. This is not the case in our work, because we assume possible communication between the processes.

The second approach is the task-based model, where the modules are called tasks and are arranged in a directed acyclic graph (dag). The model is commonly used by the researchers interested in scheduling problems, for example [73, 74, 75, 76, 77, 71]. The application is modeled as an acyclic graph with nodes representing tasks and edges representing the precedence relationship between them. Weights are associated with nodes, representing task execution time, as well as with edges, denoting communication time between the connected tasks. There are various forms of this model. For example, in the macro-dataflow model [78], there is a limited number of computing resources to execute the tasks. Given that task is a predecessor of task , if both tasks are assigned to the same processor, no communication overhead is paid. On the contrary, if and are assigned to different processors, a communication delay is paid, which depends upon both tasks and both processors. Another example is the data stream graph (DSG) model [77], which emphasizes data streams, modeled as directed edges in the task precedence graph. Each edge is characterized by a tuple containing an identifier and communication cost of this edge. In our work we do not employ the task-based model, because we assume no a priori knowledge about the precedence relationships between the operations.

The third approach is the process-based model, where processes are arranged in undirected graphs where an edge represents volume of communication between the processes. Despite the coincidence of names, the processes considered in this work are different. In this work, the equivalent of such processes are operations. The sequences of these operations are not known a priori and depend, in turn, on the processes defined in this work. It should be noted all these three approaches to modeling parallel applications can be expressed using the model proposed in this thesis by proper implementation of the computation and communication sequences.

Computation and communication operations have been distinguished in the model used in [79]. The considered problem is mapping a set of independent applications onto a set of heterogeneous processors including cpus and gpus. Each application consists of several computational kernels, which are chained together according to some computation logic. However, unlike in our approach, these dependencies between kernels are known and modeled as edges in the computation graph. The execution time of each kernel on each processor is estimated a priori using the sampling functionality of the StarPU [80] task scheduling platform.

One apparent way to create a model of an existing parallel application in the framework proposed in this thesis is to analyze its source code and indicate which fragments can be simplified to computation and communication operations. Then, simulation can be performed in order to establish the exact sequence of operations resulting from the code logic and communication between the processes. The researcher responsible for the modeling process has to decide how detailed the model should be, bearing in mind that the more detailed the model, the more computationally costly it is to perform simulation of the application. It should be noted that distinguishing the appropriate code regions could be done automatically. For example in the PATUS framework [81] in order to distinguish parts of the code (stencil computations) to be optimized through code generation, a strategy mechanism is used to separate them from the parts related to parallelization and bandwidth optimization.

Apart from the dependencies between certain parts of an application, the application model should define execution time of each part. The general model proposed in this thesis in Section 1.2 delegates this task to the computation time function which can adopt different forms depending on the specific model. For example, the computation time functions for models of applications described in Section 4.1.4 and Section 4.1.5 are defined as functions of the number of floating point operations of the computation operation (which in turn depends on the input data size) and the floating point operation performance of the used processor.

Many optimization approaches are based on an assumption that a matrix is given with estimated times to compute (etc) each task on each processor (see also Section 3.2.3). Some application modeling approaches propose how to construct this matrix based on the task and processor characteristics. For example in [70] the application is modeled as BoT (Batch of Tasks or Bag of Tasks [82]), meaning a set of independent tasks that belong to different applications and can be run simultaneously. The etc matrix is derived from the number of cycles of a task and frequency of the processor for each task-processor pair. A similar derivation can be used within the approach proposed in this thesis, through proper implementation of the comptime modeling function (see Section 5.2). On the other hand, the approach proposed in this thesis is different from the BoT approach, because there are communication dependencies between the operations (which are the equivalent of tasks).

2.3.3 Modeling Power Consumption of Parallel Applications

According to [82], among memory, disks, networks, fans, cooling system and other components of a heterogeneous computing system, significant portion of energy is consumed by its processors. Like in the paper, in this thesis only energy consumption of the computing devices available in the system is considered.

Many scheduling optimization approaches focus on dvfs-enabled hpc systems [82, 83, 77, 84, 85], where processor cores can operate in discrete performance states. This capability could be included in the system model. For example, a performance predictor for managed multithreaded applications proposed in [40] can accurately predict the impact of scaling voltage and frequency of a processor. The authors of [86] claim that even dynamic processor shutdown should be considered as an option for reducing power consumption of embedded multiprocessors, due to the trend of significantly increasing static power consumption.

Analogously to the etc matrix used for execution time modeling, some optimization solutions assume that an average power consumption (apc) matrix is given that defines average power consumption of a certain type of task on a certain type of machine. The authors of [87] indicate that these matrices are usually obtained from historical data in real environments. Some approaches are based on assumptions strictly connected with the nature of their considerations, like for example in the topology-aware task mapping approach [71], where energy required for data transmission between processors depends directly on the total traffic load of the interconnect network. In [70] energy consumption of executing a certain task on a certain processor depends on two components: dissipation power and dynamic power. The first one is a static property of the processor. The second one depends on another static property, namely physical capacitance, as well as two values decided at runtime: supply voltage and frequency of operation. This approach resembles the one adopted in this thesis, described in Section 5.2, where idle power consumption is the static component and the dynamic component is derived from current runtime parameters.

An important decision that must be made in modeling energy consumption of parallel applications is if the idle energy consumption of the computing devices should be included in the total energy consumption. In some approaches the idle energy consumption is not taken into account. For example in the approach to energy-efficient scheduling of batch-of-tasks applications on heterogeneous computing systems proposed in [70] the overall energy consumption is computed as a sum of all task executions, assuming that due to static scheduling the idle slots are negligible. In the experiments presented in [87] all the systems have zero idle power consumption, but setting power consumption to 10% of the mean power for each machine type is used to model the case when the server is powered off but a management controller is still responsive to remote power on signals.

Some approaches to multi-objective time, power and energy optimization of hpc applications focus on finding a set of Pareto-optimal solutions, in order to give the system administrator or programmer the chance to choose the most appropriate solution according to their needs. The authors of [88] propose to take into account a measurement error margin, which enlarges the set of optimal solutions so that no potentially important solution is overlooked.

Analogously to execution time of computation and communication operations, the simulation method proposed in this thesis in Section 5.2 delegates the task of power consumption modeling to the idle and stress power consumption functions which can adopt different forms depending on the specific model. Although these aspects were not considered in the experiments within this thesis, the proposed model allows to include the dvfs states and processor shutdown into the energy consumption function and passing the chosen state from the application model as a parameter. The same applies to using an apc matrix, focusing only on the network traffic or considering physical parameters of the processors.

The model used in this thesis considers power consumption in idle state of the used hardware as well as additional power consumption under stress. In this approach the idle power consumption can be configured to a certain value or a percentage of the stress power consumption. In particular, experiments that assume no idle power consumption can be configured by setting this value to zero.

2.3.4 Simulation of Parallel Application Execution

In computer simulation, the role of a simulator is to compute consecutive states of a model which is a simplified representation of a certain real object. Models used for simulation of parallel application execution in heterogeneous hpc systems have been discussed in Sections 2.3.1, 2.3.2 and 2.3.3. In this section we provide examples in what way such models can be used by a simulator.

A crucial aspect of computer simulation is to choose succeeding points in time for which the simulator should calculate consecutive states of the modeled system. Majority of the approaches to simulation of parallel application execution are based on discrete-event simulation [14], which means that the consecutive states are computed for consecutive events that occur in a discrete sequence in time. An event might represent the end of processing of a certain task by a computing device or transmission of a certain message between two devices. The sequence of events depends on the used application model. The approach used for simulation in this thesis is also based on discrete-event simulation.

SimJava [89], a layered discrete-event simulation framework based on the Java programming language is a foundation for many simulation environments aimed for specific aspects of computing systems. For example GridSim [90] is a toolkit for modeling and simulation of distributed resource management and scheduling focusing on components of grid systems. It allows for modeling and simulation of heterogeneous grid resources, users and application models and provides primitives for creating application tasks and mapping tasks to resources. Another simulator based on SimJava is OMNeT++ [91], a C++ simulation library and framework focusing on computer networks and other distributed systems. It is mainly used for building network simulators, but the MARS [92] framework based on it contains also modules modeling computing components. In that case, the events in simulation depend on replaying traces of mpi calls.

The approaches to simulation based on traces of previous application executions are called trace-driven and can combine different types of traces with different application models. For example SST/macro [93] uses an application model based on mpi along with mpi trace files in two formats: Open Trace Format and DUMPI. A trace-driven approach has been also proposed for assessing the performance of hierarchical dataflow applications in RADA [94]. The simulation environment used for simulations in this thesis is not trace-driven but tuning of the model to the results of real application executions is suggested. There are approaches to simulated evaluation of parallel application execution where application model is tuned to some existing combinations of execution parameters. For example, it is the case in [95], where curve fitting is used to obtain model parameters for the purpose of evaluating different scheduling policies.

The authors of [94] in further work [96] point out that the performance of simulations is often limited, because synchronization of a large number of processes/threads is required. To overcome the limited scalability, they proposed to use virtualization based on Linux Containers. In the environment used for simulations in this thesis, the number of simulation threads is unlikely to become prohibitively large for one simulation, because instances of processes with the same process implementation running on the same computing device are aggregated and handled by one simulation thread. The simulator is also designed for performing multiple concurrent simulations by utilizing threads and distributed processes connected to a queue of simulation instances, which allows for high scalability.

Machine learning along with traces of previous application executions can be used for evaluating application execution time instead of an application model. For example, a formulation-based predictor is used for evaluating Hadoop MapReduce application execution time in order to find approximately optimal solutions by tuning the application parameters in [97]

. The predictor fits to logs of previous Hadoop MapReduce jobs and uses a 2-level machine learning ensemble model consisting of random forest, extra trees and gradient boosting regressors to predict the median, standard deviation and wave of the application execution time.

Established toolkits such as SimGrid [98] for studying the operations of large-scale distributed computing systems, provide good basis for implementation and simulation of a wide range of algorithms. However, the authors of [99] point out that in most cases the experiments have to be developed from scratch, using just the basic functionality of the toolkit, and the experiments are rarely useful for other researchers. To address this issue, the authors propose the Grid Scheduling Simulator (GSSIM), a comprehensive simulation tool for distributed computing problems. GSSIM includes an advanced web portal for remote management and execution of simulation experiments which allows to share workloads, algorithms and results by many researchers. The MERPSYS environment co-developed by the author of this thesis and described in Section 5.2 also allows sharing application and system models as well as simulation results by the users, thanks to provided simulation database and Web application.

3.1 Multi-objective Optimization of Parallel Applications

The outline of this section is as follows: In Section 3.1.1 we provide formulation of the multi-objective optimization problem, discuss the possible approaches to taking many objectives into account in the optimization process and choose the Pareto method for consideration in this work. In Section 3.1.2 we describe chosen existing solutions to the problem of optimization of execution time under power consumption constraints, which is investigated in Section 6.1.2 of this thesis. In Section 3.1.3 we discuss examples of existing work which show that this trade-off exists for execution time and power consumption in various hardware configurations and executed applications. Finally, in Section 3.1.4 we provide examples of existing solutions considering Pareto optimization in the field of parallel computing.

3.1.1 Multi-objective Optimization Problem Formulation

In general, the goal of solving an optimization problem is to find a point in a given decision space, for which the value of a given objective function is minimal.

Definition 15.

Let paretofunction denote an objective function of an optimization problem.

For example, in our work the objectives are application execution time and power consumption. Usually there are certain constraints which have to be fulfilled in order for the decision point to be a feasible solution to the problem. In this work the constraints may be related to numbers of available computing devices, possible sizes of data chunks, numbers of processes required by the application etc. The constraints are often expressed as mathematical functions, however a more general statement would be that the decision point should belong to a feasible set, which results from the constraints.

Definition 16.

Let feasibleset denote the feasible set of an optimization problem.

Having defined the objective function and the feasible set , a mono-objective optimization problem can be stated as follows:

subject to

The convention in optimization problems assumes minimization of a function, however the definition can be easily used for maximizing (for instance real-time application throughput), by using a function as an objective. Because there is only one objective, the definition of optimum is straightforward:

Definition 17.

A point is an optimum of an optimization problem if and only if:


In the case of multi-objective optimization problems, defining the optimal solution is less straightforward, because the objective consists of paretoobjectives objective functions:

Definition 18.

Let vector denote a vector of objective functions, where paretoobjectives is the number of simultaneously optimized objectives.

Then, following the train of thought proposed in [102], the optimization problem can be written as:

subject to

The quotation marks in the "min" notation are used for a reason. In the case of multiple objectives, different interpretations can be associated with the minimization operator. Various interpretations have been discussed in [102]. In the field of energy aware scheduling, a taxonomy of optimization approaches has been presented in [75]. The multi-objective approaches are divided into three classes: aggregation, lexicographic and Pareto.

The traditional lexicographic approach assumes some preference order of the objectives. A set of mono-objective problems is solved sequentially for the consecutive objectives until a unique solution is found.

In the case of aggregation methods, a function is used to transform the multi-objective optimization problem into a mono-objective one by combining the objective functions into a single in one aggregation function. For example, the authors of [70] propose algorithms which optimize two objectives (minimization of makespan and energy consumption) at the same time. The objectives are combined into one weighted aggregation cost function. A parameter denoting the weight of each of the objectives is set to the value 0.5, so the both objectives are equally prioritized.

In this work we focus on the third approach: the Pareto method. In this case, instead of one point, the solution to the optimization problem is a set of Pareto-optimal points:

Definition 19.

We say that if . Point is Pareto-optimal if there is no such that . Let set of points denote the Pareto set.

In other words, the solution to the optimization problem is the set paretoset of all such points , that for all other points there is at least one objective function which value is lower for the point optimum. This means that in the case of two objectives considered in this work, if a point optimum is in the Pareto set paretoset, then for every other point in the decision space the point optimum has at least lower execution time or power consumption.

Additionally, it is important to distinguish the notion of Pareto front:

Definition 20.

Let set of objective function values of all Pareto-optimal points denote the Pareto front.

As indicated in [88], the Pareto front contains significantly richer information that one obtains from single-objective formulations. The solution provides not only solutions globally optimal in terms of the consecutive objectives, but also fully illustrates the trade-off between the objectives. The authors notice that a multi-objective formulation is needed for the auto-tuning problem. In the context of trade-offs between time, energy and resource usage, authors of [103] state that the multi-criteria scenario "requires a further development of auto-tuners, which must be able to capture these trade-offs and offer the user either the whole Pareto set or a solution within it".

Indeed, authors of [104] state that multi-objective approaches have not been extensively researched in the past and stress the advantages of generating the whole Pareto front for the problem of multi-objective auto-tuning of programs. Firstly, the user can visually explore the Pareto set and select the solution which fits his interest best. Secondly, some kind of aforementioned aggregation method can be used for automatic selection of an optimal solution. Having access to the whole Pareto front before constructing an aggregation function allows for normalizing the function in order to avoid drawbacks resulting from different value ranges of the objective functions. Finally, the authors notice that computing the Pareto front does not necessarily require additional computational effort comparing to a single-objective approach.

Summing up, in this work we focus on the multi-objective Pareto optimization problem of execution time and power consumption defined in Equation 1.1 and two related mono-objective optimization problems defined in Equations 1.2 and 1.3. The difference between the two latter lies in the feasible set: the first one is constrained by the power consumption limit.

3.1.2 Optimization of Execution Time Under Power Consumption Constraints

A particular type of multi-objective optimization problem is when there are strict constraints imposed on one of the objectives. In the case of the bi-objective problem formulation considered in our work this would mean a strict deadline for the execution or limit of power consumption of the computing system utilized by an application. For example, multi-objective optimization using Particle Swarm Optimization (PSO) algorithm with energy-aware cost function and task deadlines has been proposed in

[105] for partitioning tasks on heterogeneous multiprocessor platforms. In both cases of deadlines (energy or performance), although there are two objectives taken into account, a solution to the problem can be a single point like in the single-objective problem in Definition 17, provided that the constraint on the other objective is fulfilled.

Optimization of throughput of single power-constrained gpus has been investigated in [106]. The authors notice that although throughput is proportional to the product of the number of operating cores and their frequency, because of limited parallelism of some applications, it might be beneficial to scale these parameters. A technique has been proposed for dynamic scaling of the number of operating cores, the voltage/frequency of cores and bandwidth of on-chip interconnects/caches and off-chip memory depending on application characteristics. Experimental results of executing 15 CUDA applications from GPGPU-Sim [107], Rodinia [69] and ERCBench [108] benchmark suites show that the proposed technique can provide on average 20% higher throughput than the baseline gpu under the same power constraint. Although the benefits from applying the proposed technique have been presented by extensive exploring of arbitrarily chosen combinations in the search space, the work leaves open questions in the field of optimization - how to automatically find the optimal values using an optimization algorithm and simulations or real executions.

Optimization of a hpc system throughput with power constraints has been also considered on a larger scale of parallelization in data centers [95]. The problem of maximizing throughput of hpc data centers under a strict power budget has been formulated as an Integer Linear Programming (ilp) problem. The proposed online resource manager uses overprovisioning [109], power capping through RAPL interface [110] and moldable/malleable job scheduling to achieve high job throughput of power-constrained data centers. Both real experiments on a 38-node Dell PowerEdge R260 cluster and simulations of large scale executions show improvements in job throughput compared to the well-established power-unaware SLURM [111] scheduling policy.

Because taking into account all variables for start and end time of jobs would make the ilp problem "computationally very intensive and thus impractical in many online scheduling cases", the authors proposed a greedy objective function that maximizes the sum of so called "power-aware speedup" for all jobs that are ready for execution. This "power-aware speedup" is a value resulting from the Power Aware Strong Scaling (pass) model contributed in the paper, which is a power-aware extension of the model for speedup of parallel programs proposed in [112]. Using the single objective function made the ilp problem computationally adequate for online scheduling.

A strict constraint on execution time has been imposed on multi-threaded applications optimized in terms of energy consumption due to off-chip memory accesses in [113]

. In the proposed dvfs-based algorithm, the throughput-constrained energy minimization problem has been formulated as multiple-choice knapsack problem (MCKP). A cost function is defined for binary variables which denote if a certain frequency level should be assigned or not to a given process. The algorithm uses a performance model based on regression of data points reported by hardware performance counters and a power model that focuses on numbers of floating point instructions, branch instructions, L1 and L2 data cache references and L2 cache misses. Similarly to the scheduling algorithms described in Section

3.2.3, the proposed optimization algorithm assumes strong a priori knowledge about the application behavior.

However, in order to relax this assumption, authors propose also P-DVFS, a predictive online version of the algorithm which, similarly to the scheduling algorithms described in Section 3.2.4, does not require a priori knowledge about the application. The proposed prediction technique relies on the similarity between present and future mpi distributions. Experimental results from 11 chosen benchmark applications from SPEC2000 [114] and ALPBench [115] benchmark suites executed on a Pentium Dual Core processor proved around 10% average power reduction as compared to the most advanced related work.

Dynamic Core and Frequency Scaling (dcfs) [116], a technique for optimizing the power/performance trade-off for multi-threaded applications is an extension to dvfs which apart from cpu frequency, adjusts also core counts (core throttling). The adjustments are made dynamically during the application execution, in order to optimize performance under a certain power consumption constraint. During the training phase, the application is executed for a short period of time for chosen combinations of core numbers and cpu frequencies. The optimal configuration is chosen based on measured IPS (instructions per second). The proposed technique dynamically reacts to the changes in application behavior: the IPS values are measured during the execution phase and if the value changes by a given threshold, the application is switched to the training phase for a certain period.

The proposed technique has been evaluated on ten benchmark applications from PARSEC [117] executed on AMD Opteron and Intel Xeon processors. Although the average performance improvement of multiple benchmark applications is 6%, the majority of this score depends on one, poorly scalable application from the domain of enterprise storage for which the performance improvement is 35%. There is barely any performance improvement in the cases of other applications, even for two poorly scalable ones, because they are the most memory-bounded. These results mean that the proposed technique is suitable only for a specific kind of applications. One of the reasons of low performance is the overhead of the training phase. The authors consider various lengths of the training phase period, as well as IPS change thresholds. The dcfs method could benefit from a computationally cheaper way to adjust the execution parameters, for example a simplified application model and/or simulation scheme.

3.1.3 Energy/Time and Power/Time Trade-offs in Parallel Applications

A multi-objective approach to optimization would not be needed if the considered objectives were not contradicting in some cases. Existence of a trade-off between energy consumption and execution time of an application may seem counter-intuitive. One could argue that the shorter a program runs, the less energy it consumes. This was indeed the case for a discrete Fourier transforms application executed on Intel Pentium M microprocessor investigated in

[118], as long as the application was optimized only via an algorithm selection software technique. As expected, the authors state that for a given voltage-frequency setting of the microprocessor, the fastest algorithm was also the lowest energy algorithm. However, in cases when voltage-frequency scaling was also considered, trade-offs between energy consumption and execution time have been reported. Namely, for some problem configurations it was possible decrease overall energy consumption at the cost of increasing the execution time by executing the program on a lower microprocessor gear.

Energy/time trade-offs have been also shown in [119] on the example of executing programs from the Numerical Aerodynamic Simulation benchmark [120]

on a power-scalable cluster using mpi. The hardware in the cluster allowed energy saving by scaling down the cpus. The decision space concerned choosing among available gears at which the processors were operating, as well as the number of processors used for execution. In the cases when the number of processors utilized by the application was fixed, there were usually single or few Pareto-optimal points in the decision space. However, when the number of used processors ranging from one to nine was a degree of freedom in the optimization, energy/time trade-offs were reported for several applications. The trade-offs vastly depended on the scalability of the programs. In some cases energy and time could be saved by executing a program on more nodes at a slower gear rather than on fewer nodes at the fastest gear. This shows that non-trivial trade-offs between energy consumption and execution time can appear during parallel execution.

These trade-offs may be even more complicated if the computing devices available in high performance computing systems are heterogeneous, the application consists of multiple tasks and execution time of a particular task can differ depending on the assigned device. Such a model has been studied in [121], where the Variation (COV) method [122] has been used to model a heterogeneous set of machine types and task types. Different resource allocations resulted in different makespan and energy consumption values, giving a wide set of Pareto-optimal options to choose from. Examples of Pareto fronts for a system modeled in such a way have been also presented in [87] for various numbers of computing machine types.

The authors of [121]

focused on discovering the Pareto front using the NSGA-II evolutionary algorithm, but also investigated in detail how solutions in the Pareto front differ from one another. For this purpose, the authors analyzed the individual finishing times and energy consumptions for chosen points in the Pareto front. Apparently, the location of a point in the Pareto front depends on balancing of the individual tasks both in terms of execution time and energy consumption. In the case of the points with extremely low execution time, the finishing times of individual jobs were fairly balanced and relatively low, while the energy consumption values were uneven with the highest values on a relatively high level. On the other hand in the case of points with extremely low energy consumption, the task completion times were uneven, while the energy consumption levels were balanced. This observation helps to understand that in the case of highly heterogeneous systems, the trade-off between execution time and energy consumption can be partly explained as a trade-off between fair balancing of the completion times and energy consumption values of individual tasks of the application.

The relationships between execution time, power consumption and energy consumption of hpc codes have been studied in [88]. Pareto fronts for energy/time and power/time trade-offs are shown for a set of chosen hpc applications executed on chosen parallel architectures. The results include measurements from real executions of:

  • finite-difference time domain [123], sparse matrix multiplication [124] and quick sort [125] on an Intel Xeon Phi coprocessor;

  • finite-difference time domain, bi-conjugate gradient and Jacobi computation [123] on an Intel Xeon E5530 processor;

  • fine element mini-application miniFE [126] on a Vesta IBM Blue Gene/Q 10-petaflop supercomputer.

In all presented power/time charts the Pareto front consisted of several points, which proved that the power/time trade-off exists for the selected problem configurations. However, energy/time trade-off was reported only for the miniFE application, due to considering the parameter of the number of used nodes. In other experiments there was only one point in the energy/time Pareto front. This means that in these experiments there was no energy/time trade-off, because there was one solution that was optimal for both objectives.

The authors notice that since power corresponds to a rate of energy, the problem of bi-objective optimization of execution time and power consumption is clearly related to the problem of bi-objective optimization of execution time and energy consumption. Moreover, the authors prove that all points on the energy/time Pareto front have a corresponding point on the power/time Pareto front. Specifically, denoting the time, power and energy objectives by T, P and E respectively, the authors prove that where is the set of Pareto-optimal points for the vector of objective functions and is the set of Pareto-optimal points for the vector of objective functions .

The conclusion from these findings is that exploring the power/time trade-off gives richer information about the potentially favorable execution configurations, including those in the energy/time Pareto set. For this reason in this work we focus on bi-objective optimization of execution time and power consumption.

3.1.4 Pareto Optimization of Parallel Application Execution

One method of finding optimal values of decision variables in an optimization problem is exhaustive search [127], which means evaluating the values of objective functions for all possible combinations of decision variable values. This approach may be infeasible in many cases of optimization of execution time and power consumption in hpc. First, precise evaluation of the objective functions may involve actual execution of the application, which is often extremely costly. In this work we propose using the simulation method described in Section 5.2 to evaluate the objective functions at low cost.

Secondly, the decision space of the optimization problem may be high-dimensional and, thus, exhaustive search may require vast numbers of evaluations. Even in the case of using a low cost model or simulator, the number of evaluations often makes the exhaustive search method infeasible. One approach to solve this problem is to use derivative based optimization techniques, such as gradient descent. However, this approach requires the objective functions to be differentiable. Some models of parallel application execution provide differentiable formulas for execution time and power consumption, but it is rarely the case considering complexity of the contemporary parallel applications and systems.

In other cases derivative-free approaches are needed, such as genetic algorithms, particle swarm optimization etc. Their assumption is that the explicit mathematical formula behind the objective functions is unknown and evaluations are possible for certain points in the decision space, however the number of evaluations is treated as the main computational cost of the algorithm. Such a simulator for evaluating the objective functions, as the one proposed in this work can be used as an evaluation function in the derivative-free algorithms.

For example, the framework for multi-objective auto-tuning proposed in [30] (described more broadly in Section 3.3.3) allows to compute the Pareto set for the trade-off between execution time and percentage of hardware utilization of parallel codes. The optimizer uses a combination of compiler and runtime techniques and the decision parameters include tile sizes in loop tiling, thread counts and choosing between code versions. The authors claim that in the considered testbed, the Pareto set is prohibitively large, making exhaustive search impossible. A simple genetic algorithm is proposed, but still the number of required steps is too large to represent a viable option. Finally, the differential evolution gde3 [128] algorithm is used in the optimization phase. At runtime, the system can choose an optimal configuration from the computed Pareto set using weights provided for each optimization goal.

Pareto fronts have also been explored to trade-off energy and performance in a heterogeneous hpc system where multiple optimal solutions resulted from different task assignments [129]. The estimated times to complete of each task on each machine were assumed to be given in an etc matrix (generated randomly for the experiments). The finishing times and energy consumption were modeled by strict mathematical formulas, so evaluating one solution was relatively cheap computationally. Still, in the testbed with 1100 tasks of 30 types and 36 machines of 9 types, exhaustive search was infeasible for the scheduling process, which needs to be fast in order not to add a significant overhead to the overall processing time.

For this reason, as an alternative to the exhaustive search solution, NSGA-II algorithm [130] has been used, which is a popular adaptation of the genetic algorithm optimized to find the Pareto front for a multi-objective optimization problem. The algorithm modifies the fitness function of the genetic algorithm to work well for discovering the Pareto front. An important phase of a genetic algorithm is seeding the initial population. Employing the basic seeding method using the optimal energy solution, suboptimal minimum makespan solution and a random initial population, the algorithm needed hours of computations to discover a reasonable approximation of the Pareto front in [129]. The authors proposed a different seeding strategy for generating configurations with full allocations. This modification allowed the optimization algorithm to achieve significantly closer approximations of the Pareto front in just dozens of seconds.

It should be stressed that in the cases where defining a strict mathematical formula for the optimization objectives is possible, there is often no need to use a simulator and search for the optimal solutions either by exhaustive search or evolutionary algorithms. For example, the performance of the NSGA-II algorithm has been significantly outperformed by a linear programming solution in [87] for multi-objective optimization of energy and makespan of Bag of Tasks applications. However, using this technique requires defining linear objective functions for the considered objectives. As discussed in Section 2.3, in this work we focus on cases where exact formulas for the optimization objectives are unknown and, thus, we consider the evolutionary algorithms for potential utilization in the proposed optimization methodology.

3.2 Energy-aware Resource Management in Heterogeneous HPC Systems

Mapping processes to computing devices is an important part of the optimization methodology proposed in this thesis. In this section we provide background for our work in the field of resource management. In Section 3.2.1 we introduce the global static task mapping problem, which is an important part of the problem formulation proposed in Chapter 1. We discuss chosen scheduling optimization solutions focusing on network topology in Section 3.2.2. The remaining related work is divided into solutions with strong a priori knowledge about the optimized application in Section 3.2.3 and with limited knowledge in Section 3.2.4.

3.2.1 Global Static Task Mapping Problem

Resource management has been a crucial topic in distributed computing for many years. Numerous different problem formulations have been stated, also in other fields, such as control theory, operations research and production management. Comparing the approaches has become hard because of their vast number and essential differences resulting from particular setups and applications. In order to achieve categories of comparable approaches, a taxonomy of distributed scheduling approaches has been proposed in [131].

According to this taxonomy, the approach proposed in this thesis is global, because it considers where to execute a process and assumes that local scheduling is the task of the operating system of the computing device. This local scheduling is connected with assigning processor time to processes, as well as optimizing the utilization of device internals. These tasks are solved by the operating system and increasingly often by internal schedulers in the devices. This work does not explore the details of these tasks, focusing on the problem of global scheduling.

Global scheduling problems are divided in the taxonomy to static and dynamic scheduling problems. In the case of static scheduling, information regarding the total mix of possible processes in the system is available before the application execution. In this sense, the approach proposed in this thesis considers a static scheduling problem, because a static schedule in the meaning of assignment of processes to computing devices is fixed at the beginning of the execution.

Global static scheduling problems are divided in the taxonomy to optimal and suboptimal problems. As indicated in [131]: "In the case that all information regarding the state of the system as well as the resource needs of a process are known, an optimal assignment can be made based on some criterion function. Examples of optimization measures are minimizing total process completion time, maximizing utilization of resources in the system, or maximizing system throughput. In the event that these problems are computationally infeasible, suboptimal solution may be tried". In this sense, the approach proposed in this thesis is suboptimal, because not all information about the processes is known a priori. For example, there is no estimation of execution time or resource needs of each process. We argue that for many applications it is hard or impossible to prepare a feasible criterion function, which could be used for preparing an optimal assignment.

Global static suboptimal scheduling problems are divided in the taxonomy to heuristic and approximate problems. Heuristic algorithms make use of special parameters which are correlated to system performance in an indirect way, and such alternate parameters are much simpler to monitor or calculate. Such heuristic algorithms make the most realistic assumptions about a priori knowledge concerning process and system loading characteristics. The assumptions can be made in approaches to optimization of specific applications in specific systems. In this thesis we propose a general approach which cannot include such assumptions. Hence, it is an approximate approach, in which "instead of searching the entire solution space for an optimal solution, we are satisfied when we find a ’good’ one"


, based on certain evaluation metrics.

According to [131], important factors that determine if the suboptimal-approximate approach is worthy of pursuit include availability of a function to evaluate a solution and the time required to evaluate it. Results from real execution of a parallel application can be such a function. In cases when the factor of evaluation time is prohibitively large, we propose using functions based on modeling and simulation, as described in Section 5.2.

To sum up, the approach proposed in this thesis is giving a suboptimal-approximate solution to the global static scheduling problem, also known as task scheduling, task mapping or task allocation. The problem of task partitioning among heterogeneous multiprocessors has been proven NP-hard in [132].

In [133], the resource management process was divided into three stages:

  • scheduling – deciding when each job should run;

  • allocation – determining which nodes take part in the computation;

  • task mapping – matching the job to individual computational elements (nodes/cores).

The authors assumed that the first two stages are usually done at the system level and focused on improving the task mapping stage, which in that case meant mapping tasks to mpi ranks.

In this nomenclature, the approach proposed in this thesis focuses on the task mapping stage which includes allocation, because assigning no processes to a device means not allocating the device. In this sense, scheduling of operations in the proposed approach depends on the process implementations given in the application model. According to [100], the problem of mapping each task of an application onto the available heterogeneous resources in order to minimize the application runtime is known to be NP-hard.

3.2.2 Network-aware Scheduling

One of the important factors that influence execution of parallel applications in hpc systems is network topology. In [78], the authors point out that the traditional macro-dataflow model of application execution was inconvenient, because it assumed unlimited network resources, allowing simultaneous communications on a given link. They propose a communication-aware and one port model in order to take into account the influence of network topologies on the scheduling algorithms. For example, the Data Intensive and Network Aware (diana) scheduling technique proposed in [134] takes into account not only data and computation power, but also network characteristics such as bandwidth, latencies, packet loss, jitter and anomalies of network links.

Network-aware scheduling is still an active branch of scheduling studies. For example, the objective for scheduling optimization in [71] is defined as minimization of numbers of hops in shortest paths between devices. This problem is called topology-aware task mapping or just topology mapping.

Resource scheduling in data centers with two-tiered and three-tiered network architectures has been studied in [135]. The authors propose a topology-aware scheduling heuristic and demonstrate its performance using the GreenCloud [136] packet-level simulator.

In [77] the response time of a real-time stream computing environment is optimized by minimizing the latency of a critical path in a dag representing the application. The proposed Re-Stream solution has been verified in a simulation environment based on the Storm [137] platform.

A simulator for evaluating the fitness of the intermediate solutions in an optimization algorithm was used for example in [133]. The authors propose a local search algorithm which tries swapping pairs of tasks in order to minimize the application execution time by reducing the number of network hops. Fitness of the solutions is measured by a simulator [138] and a swap is preserved if it decreases the number of hops. The proposed solution, aimed for applications with stencil communication patterns, has been proven useful on the example of a shock physics model application executed on a Cray XE6 system.

In the model proposed in this thesis, network topology can be taken into account in the system model through proper implementation of the commtime function (see Section 5.2). Network topology plays also an important role in experiments with real application execution regarding network-aware optimizations described in Sections 6.1.1 and 6.1.2.

3.2.3 Heuristics Based on Strong a Priori Assumptions

There are numerous approaches to task scheduling which assume that the (etc) matrix is given [139, 140, 75, 79, 84, 85, 70]. The matrix contains the expected execution times of each task on each processor. Variations of this assumption are sometimes used, for example a computation cost matrix [141]. A similar approach is often used towards power consumption, for example authors of [87] assume that the apc matrix is given.

In such problem frameworks, meta-heuristics are often used, including genetic algorithms [139, 140, 75], tabu search [140], simulated annealing [140], A* [140] and shuffled frog-leaping [84]. The objective is to obtain an optimal schedule, namely the assignment of tasks to processors as well as determining order of execution within each processor. A thorough review of traditional and energy-aware algorithms based on etc matrix has been provided in [70]. The authors also propose two new scheduling algorithms which introduce a task migration phase for minimizing the makespan and energy consumption of the application.

The task mapping problem has been also considered in the topology-aware context in [71]. The authors propose two graph-theoretic mapping algorithms: a generic one with inter-node and intra-node mapping and a recursive bipartitioning one for torus network topology, which takes into account compute node coordinates.

Focusing on the aspect of heterogeneous cpus, paper [142] proposes Heterogeneity Aware Meta-scheduling Algorithm (HAMA), claimed to reduce between 23 and 50% of energy consumption. The grid meta-scheduler described in the paper, collects information about the grid infrastructure and users, and periodically passes it to the HAMA algorithm. Based on parameters such as average cooling system efficiency, cpu power, frequency and computation time slots, the algorithm first selects the most energy efficient resources. What is more, if possible, it utilizes the Dynamic Voltage Scaling capabilities of the cpus.

3.2.4 Approaches With Limited a Priori Knowledge

The model proposed in this thesis is especially useful for modeling applications for which, at a given granularity level, it is hard or impossible to estimate the exact graph of individual tasks and communication dependencies between them, before actual execution of the application. Chosen approaches that also assume limited a priori knowledge about the application and its processes are described in this section. For example authors of [83] notice that task execution times in large heterogeneous computing systems may vary due to factors which are hard to incorporate into the model, like cache misses or data dependence of the execution times. They propose a stochastic measure for minimizing the probability of violating the makespan and energy constraints. This robust measure is used as the objective function for various heuristic algorithms, including tabu search and genetic algorithm with local search.

The application model in [82]

is a Bag of Tasks, which could have different execution time for different inputs. Because of that, the authors interpret task execution times as random variables and consider

stochastic task scheduling. They propose algorithms with an objective to improve the weighted probability of meeting both deadline and energy consumption budget constraints. The proposed algorithms are performing significantly better than the traditional heuristics in an experimental setting with dvfs-enabled hcs, for both randomly generated BoT applications and real-world multimedia applications.

Authors of [143] propose a simulated annealing approach to optimizing task allocation in a grid environment with respect to execution time. Comparison to an ad-hoc greedy scheduler shows that in certain cases the simulated annealing approach allows to avoid local minima in the optimization. During the optimization process, the solutions are evaluated using a hand-crafted performance model. The approach is verified on a simplistic testbed consisting of 15 machines, running a parallel numerical solver application. The authors emphasize, that the usefulness of their approach depends vastly on the accuracy of the performance model, for which the simulation method proposed in this thesis might be a convenient replacement.

The uncertainty about the processes has been also considered in the field of cloud computing. The authors of [76] notice that the existing efficient schedulers require a priori information about the processes and ignore cluster dynamics like pipelining, task failures and speculative execution. They propose a new scheduling algorithm for minimizing average coflow completion time (cct) in data centers by prioritizing the processes (coflows) across a small number of priority queues. The coflows are separated into the queues based on their past activity in the cluster. The solution is proven efficient by experiments run on 100-machine EC2 clusters.

Lack of knowledge about the task processing times has been also studied in the context of game theoretic approach to distributed scheduling [144]. The considered problem is a scheduling game, where each player owns a job and chooses a machine to execute it. Even if there exists an equilibrium in this game, the global cost (makespan) might be significantly larger than in the optimal scenario. There exist policies that reduce the price of anarchy, but typically they have access to the announced execution times of all tasks in the system. Policies studied in the paper are non-clairvoyant, which means that they assume that the task processing times are private for the players and, hence, not available a priori.

3.3 Parameter Auto-tuning in Parallel Applications

A significant part of the parallel application optimization methodology contributed in this thesis can be described in terms of parameter auto-tuning. In this section we provide background for our work in this field. In Section 3.3.1, the problem solved within this thesis is classified as a problem of offline auto-tuning of system parameters. Then, chosen approaches to parallel application auto-tuning are described, divided into those involving exhaustive search of the optimization search space in Section 3.3.2 and those involving combinatorial search in Section 3.3.3.

3.3.1 Offline Auto-tuning of System Parameters Problem

According to the proceedings of a recent seminar in the field of automatic application tuning for hpc architectures [127], approaches to application auto-tuning can be divided into black-box and white-box. The search process in white-box algorithms can be guided, because there is some a priori understanding of the underlying problem. Notable examples of white-box auto-tuning approaches are ATLAS [145] for automatic tuning of linear algebra applications and Spiral [118]

for automatic generation of linear transform implementations. In the ELASTIC

[146] environment for large scale dynamic tuning of parallel mpi programs, the knowledge required to guide the auto-tuning process is integrated as plugins which implement an API for modeling performance and abstraction models of the application.

Many auto-tuning solutions focus on optimizing parallel programs by choosing between multiple alternatives of semantically equivalent but syntactically different versions of a program. Two ways of source code adaptation are distinguished in [145]. The first one is to supply various hand-tuned implementations and allow the optimization algorithm to choose between them. The second method is automatically generating the code by using manual transformations or compiler options.

For example, an auto-tuning framework introduced in [147] is able to parse Fortran 95 codes in order to extract Abstract Syntax Tree (ast) representations of stencil computations. The framework generates multiple versions of optimized stencil codes by multiple transformations of the ast code representation. The results of the transformations depend on a number of serial and parallel optimization parameters. In order to achieve a feasible subset of parameter spaces, architecture-aware strategy engines are used. Then, an auto-tuner performs exhaustive search on the limited parameter space. Additionally, the framework allows migrating existing Fortran codes to emerging parallel APIs such as CUDA. Focus on stencil computation has been also put in the PATUS framework [81], which allows generating code of stencil computation kernels from initial codes called specifications. The framework automatically distinguishes regions of the code which contain so-called operations responsible for the stencil computations and generates different versions of these regions.

In contrast to white-box approaches, in black-box approaches there is an assumption that the only knowledge about the the optimized application can be obtained through evaluating an instance of parameter set, and not through analysis of its code. In this sense, the execution steps proposed in this thesis in Section 5.1 consist of both white-box and black-box steps. The first step, process optimization is a white-box optimization step, where an analysis of processes in the application can be made in terms of the underlying operations and modifications of the operation sequences can be made. After this step has finished, the succeeding steps solve a black-box optimization problem, because no further modifications of the processes are allowed.

Similarly to scheduling approaches described in Section 3.2.2, there are also tuning approaches that stress the importance of network interconnect. Authors of [148] argue that optimization of application execution in next generation large-scale platforms, especially for energy efficient performance, should not only use cpu frequency scaling, but could also benefit from tuning other platform components, for example network bandwidth scaling. The paper exploits power measurement capabilities of Cray XT architecture and proposes a static tuning approach which demonstrates energy savings of up to 39%. It is noted that a dynamic approach is also an important area of investigation, though it is challenging due to reliability issues and overhead of frequency state transitions.

According to [104], execution parameters such as numbers of threads, their affinity, processing frequency and work-group/grid sizes of gpu applications are equally important tuning parameters, called system parameters. Even the mapping of threads onto physical cores (discussed in more detail in Section 3.2) can be considered as a part of the parallel application auto-tuning process. Two classes of auto-tuning approaches are distinguished in the paper: offline and online. In the offline version the program is tuned before running it in production mode. The online version has challenging aspects, because while being executed during the application execution it implies performance overhead and makes the execution more exposed to possible poor performing parameter configurations. The optimization approach proposed in this thesis is based on evaluating multiple execution configurations before the actual execution, thus in the sense of this classification it focuses on the problem of offline auto-tuning of system parameters, to which we refer to as application execution parameters and process mappings. Using simulation for re-evaluating certain application execution parameters during the actual application execution can potentially be beneficial for specific types of application and is an interesting direction for future work.

3.3.2 Auto-tuning Approaches with Exhaustive Search

A plugin-based approach has been used in the European AutoTune project to extend the PERISCOPE [149] performance analysis tool by a number of tuning plugins, producing the Periscope Tuning Framework (ptf) [150]. The plugins may employ expert knowledge or machine learning to perform multi-aspect application tuning with regard to energy consumption, inter-process communication, load balancing, data locality, memory access and single core performance. The tuning process starts with preprocessing C/C++ or Fortran source code files using mpi or OpenMP in order to distinguish code regions and parameters that may influence their performance. For each code region, tuning scenarios defined by the plugins perform search strategies in the parameter search space in order to minimize a tuning objective, defined as a function which may take into account measurements like execution time and energy consumption.

The applicability of the ptf framework was presented in [150] on the examples of the following plugins:

  • maximizing throughput of high-level pipeline patterns written in C/C++ with OpenMP pragma-annotated while loops, executed on single-node heterogeneous manycore architectures using StarPU [80] for execution on cpus and gpus. The main tuning parameters were stage replication factors and buffer sizes;

  • minimizing execution time of HMPP codelets - computational units written in C or Fortran, annotated with directives which allow the special CAPS compiler to translate them to hardware-specific languages such as CUDA and OpenCL. Considered tuning parameters were connected with the codelet internals such as unrolling factors, grid sizes, loop permutations and also target-specific variables and callbacks available at runtime;

  • minimizing energy consumption of applications executed on shared memory processors with cpu frequency scaling. The tuned parameters were energy efficiency policies and used cpu frequencies;

  • minimizing execution time of mpi spmd programs by tuning mpi annotated code variants and environment parameters including numbers of executed tasks, task affinity, communication buffer sizes and message size limits;

  • reducing execution time of sequential programs by tuning the selection of compiler flags.

The PTF framework has been also used in [151] to minimize execution time of mpi programs by tuning the parameters of MPI-IO communication interface. The proposed PTF plugin aimed for automatically optimizing the values of selected MPI-IO hints and mpi parameters, which are normally optimized by programmers who have a deep understanding of the application behavior on the target system. The authors state that because of high dimensions, the space of tuning parameters still needs to be restricted using expert knowledge. Exhaustive search is used to find the optimal parameter values. Exploring more elaborate search algorithms as well as parallel application models is listed as future work.

The importance of application auto-tuning has been stressed in the context of code portability in [152], where an OpenCL implementation of convolutional layers for deep neural networks is proposed. Codes in OpenCL can be executed without changes on various hardware by compiling them using local compilers dedicated to certain computing architectures. However, usually due to the differences between architectures, in order to develop a highly efficient implementation, one needs to take into account specific coding practices and low-level details. The authors propose to implement the kernel in a tunable way, accepting size of the input images, filters and computing thread work-groups for each layer of the optimized neural network as inputs. The approach achieves full portability of the kernels without the need to develop multiple specific implementations, while maintaining good performance. For the auto-tuning problem, the optimization space is searched exhaustively, however automatic space pruning is done, so that only nearly 20% of the configurations are tested. Auto-tuning a single layer out of the five layers of the neural network takes about one hour, which is claimed not to be a significant overhead compared to the entire network training time that could take weeks of repeatedly running the same set of kernels.

Often the search space of an auto-tuning problem is high dimensional and prohibitively large to perform exhaustive search. One approach to perform auto-tuning in such situations is to explicitly prune the search space. For example, a search space reduction procedure has been proposed for auto-tuning of a parallel implementation of the 2D MPDATA EULAG algorithm, executed in a hybrid cpu-gpu architecture [22]. The algorithm consists of 16 stages linked with non-trivial data dependencies and the implementation consists of 16 kernels. The parameters that constitute the search space in the auto-tuning problem have been divided into two groups. The parameters in the first group create a local search space for each kernel individually, and include work-group sizes and sizes of vectors for vectorization. The second group consists of specific parameters related to the entire algorithm (this division of parameters is similar to the one proposed in this thesis, with the local parameters resembling execution parameters and algorithm-related parameters resembling application parameters). The size of the global search space defined by ranges of all applicable parameters is above 524 million combinations, which makes testing all the configurations unacceptably expensive. The authors provide a group of methods which allow to radically reduce the search space by applying certain domain-specific constraints. This allows to prune the search space to over 379 thousand and 965 thousand combinations for ATI Radeon and NVIDIA gpu respectively. Then, the auto-tuning mechanism evaluates all configurations in the search space to select the best configuration corresponding to the shortest execution time.

3.3.3 Auto-tuning Approaches with Combinatorial Search

In many cases when search space pruning is infeasible, approaches alternative to exhaustive search are used, where only chosen combinations of the parameters are evaluated. This section discusses chosen approaches that use such methods, that are called combinatorial search methods.

The Insieme Compiler and Runtime infrastructure111http://insieme-compiler.org has been used in [30] as a test platform for tuning loop tiling in cache-sensitive parallel programs. A combination of compiler and runtime techniques allows tuning parameters of code regions, such as tile sizes, loop ordering and unrolling factors. An optimizer is proposed, which generates multiple application configurations and evaluates them by running the programs on the targeted platform in order to find optimal solutions. However, because the solutions are evaluated by real program executions and the parameter space is large, it is impossible to perform an exhaustive search evaluating all the parameter combinations. To address this problem, a RS-GDE3 search algorithm based on Differential Evolution and rough set theory is proposed.

Approximate optimal values of the application parameters are found by the Generalized Differential Evolution (gde3) algorithm, which allows to decrease the search time by evaluating each of the configurations from a population in parallel. A search space reduction using the rough sets method proposed in [153] is used to reduce the search space in every iteration of the search algorithm. The solution is evaluated on a case study of a nested loop matrix multiplication application on a target platform employing 10-core Xeon E7-4870 processors. The proposed search algorithm finds similar solutions as an exhaustive brute-force search, but uses from 90% to 99% fewer evaluations.

The Insieme infrastructure and the RS-GDE3 search algorithm have been used for multi-objective optimization of parallel applications also with regard to energy consumption [103]. The solution was tested on matrix multiplication and linear algebra applications, stencil codes and n-body simulation executed in a shared-memory system with 8-core Intel Xeon E5-4650 Sandy Bridge EP processor. The proposed algorithm outperformed chosen general-purpose multi/objective optimization algorithms such as hierarchical and random search and NSGA-II [130].

The 8-core Intel Xeon E5-4650 processors were also used for a comparison of the RS-GDE3 search algorithm with mono-objective auto-tuners using local search, simulated annealing, genetic algorithm and NSGA-II based on a n-body simulation application [104]. Loop tile sizes, thread numbers and processor clock frequencies were the tunable parameters. These experiments confirmed superiority of RS-GDE3 in the multi-objective setup.

Application-specific parameters have been tuned in [29] to optimize performance of a GS2 physics application for studying low frequency turbulence in magnetized plasma. Unlike in our work, the considered problem is online tuning, focused on performance variability. The tuned parameters can be changed during the program execution and optimal parameter values can change during the runtime. The PRO (Parallel Rank Ordering) algorithm is proposed as an alternative to the traditional Simplex algorithm, which is claimed to have unpredictable performance in the case of tuning more than one parameter. The proposed algorithm belongs to a class of direct search algorithms known as GSS methods and is resilient to performance variability.

Execution time of large-scale dataset processing applications with Apache Hadoop is optimized by parameter tuning in [97]. Out of more than 130 configuration parameters of the Hadoop MapReduce system, 20 most affecting the system performance have been chosen arbitrarily. These parameters specify the way how the data should be processed in each phase of the MapReduce job execution with regard to parallelism, memory capacity, job flow and data compression. The impact of these features on the application execution time is identified using the random forest feature importance method. Five most influential parameters are chosen for optimization in the experiments.

In the optimization process, the application configuration parameters are repeatedly modified by an optimizer and evaluated by a predictor, a model that predicts median, standard deviation and wave of the application execution time, as described in Section 2.3.4. The optimizer uses the predicted values as input to machine learning ensemble regression methods. The authors assume adequate efficiency of the predictor and solve the black-box optimization problem using the RHC method (combination of Random Sampling and Hill-Climbing). A high dimensional space is used as the feasible set of parameters. In the exploration phase, the parameter space is examined by random sampling to find areas with high probability of approximately optimal parameters. In the exploitation phase, the hill-climbing algorithm is used to search these areas more deeply.

The solution is evaluated on original MapReduce jobs such as TeraSort and WordCount, text processing and hive-aggregation, executed on an 8-node cluster with Intel i7 cpus. The approximately optimal parameter settings found by the proposed algorithm enable up to 8.8-fold improvement of execution times as compared to the default values of the parameters. The authors claim that this automatic tuning approach is useful in real life setups because it is hard for the system administrators to set the multiple parameters by hand.

4.1 Investigated Hybrid Parallel Applications

In this thesis, big emphasis is put on the practical use of the proposed contributions. For this reason, we aimed to base our experiments on diverse applications that are useful in real life. In this section we describe the applications developed for the sake of this thesis, including MD5 hash breaking in Section 4.1.1, regular expression matching in Section 4.1.2, geostatistical interpolation of air pollution values in Section 4.1.3, large vector similarity measure computation in Section 4.1.4 and training deep neural networks for automatic speech recognition in Section 4.1.5. A description of each application is given, including the specification in what way the application is hybrid.

4.1.1 Multi-level Heterogeneous CPU/GPU MD5 Hash Breaking

The MD5 hash breaking application has been used as a verifying application for the framework for automatic parallelization of computations in heterogeneous HPC systems co-developed by the author of this thesis and described in [13]. The purpose of the application is to retrieve a lost password, given its MD5 hash. The brute-force attack method is used, which means performing the encryption procedure for all passwords from a given range and comparing their hashes to the given one. The application implements the task farming parallel programming paradigm, where the master orders the slaves to search specific subranges of the feasible password space.

In order to ensure fair comparison of execution time for various data partitionings, the application checks all passwords in a given range, whether or not the appropriate password is found. Thus, depending on the assumptions about the possible password length and character set, quite different problem sizes can be achieved, which makes the application useful for execution experiments in systems with various computing powers. For example, employing both cpu and gpu of one "des" workstation from the department laboratory described in Section 4.2.1 it took around 30 seconds to search all passwords up to six characters long, while for passwords up to eight characters long the execution time was much larger - around 4.5 hours. In the context of recovering passwords which often consist of multiple characters, reducing the application execution time would be a crucial improvement of the application.

The application is hybrid in two meanings. First, using the functionality of KernelHive, the application can be executed on multiple clusters consisting of multiple nodes equipped with multiple computing devices, which in turn can be parallel on a lower level. This way, the application is hybrid in the multi-level sense. Secondly, due to the implementation in OpenCL and parallelization capabilities of KernelHive, the application can be executed on heterogeneous computing devices if they provide an OpenCL runtime. This makes the application hybrid in the sense of computing device heterogeneity.

4.1.2 Heterogeneous Regular Expression Matching with Configurable Data Intensity

Optimal configuration for efficient execution of a parallel application strongly depends on the application profile, namely whether it is computationally intensive, requires frequent inter-process communication or if it mostly depends on the input data and how efficiently it can be delivered to the computing device. The idea behind the regular expression matching application proposed by the author of this thesis in [11] was to develop one application that has a different ratio of computational intensity to data intensity depending on the given input data.

The goal of the application is to find all matches of a given regular expression in a given text file. The regular expressions consist of characters which have to appear in the text in the given order and a special character "*" which is a wildcard that matches one or more occurrences of any character. This way, a complex signature of the sought text can be defined. As the application searches the whole given text file regardless of line endings, the computational cost of the search strongly depends on the assumed maximal number of characters matched by the wildcard character, as well as the number of wildcard characters in the signature. Because of this, various application profiles can be achieved, ranging from extremely compute intensive (~431s for searching a 1MiB file) to extremely data intensive (~3s for searching a 512MiB file).

The application is hybrid both in the multi-level and heterogeneous sense, because it is implemented in OpenCL and integrated with the KernelHive framework. This allows for testing the influence of the computational/data intensity ratio on the application execution on various computing devices.

4.1.3 Multi-level Heterogeneous IDW Interpolation for Air Pollution Analysis

The geostatistical interpolation application investigated within this thesis has been implemented within the master thesis by the author of this dissertation [154] as a part of a module for the SmartCity system designed to support the local government of the city of Gdańsk, Poland. The goal of the module is to provide the user with visualizations of particulate matter air pollution. Preparing the visualizations requires estimating the air pollution level at non-observed locations based on real measurements from ten regional monitoring stations, taken in hourly schedule.

The used interpolation method, inverse distance weighting (idw), derives the interpolated value for each point on the requested area from the real measurement values, normalized proportionally to the distance between the interpolated point and the measured point. To perform the interpolation for one point in time, a basic Python implementation running on one core of the Intel Core i7-4770 processor needed around 36 minutes. Given the hourly measurement schedule, this performance would be enough to render visualizations in real time, however a practical use case was to render visualizations for multiple sets of historical data. Rendering the visualizations for measurements from one year would require around seven months of computations.

For this reason, a massively parallel implementation has been proposed and tested on a single computing device in [154]. In [13] the author of this thesis contributed reducing the execution time of the application time by scaling it to a multi-level setup using the proposed KernelHive framework. Being implemented in OpenCL and integrated with the KernelHive framework, the application is hybrid both in the multi-level and heterogeneous sense. Additionally, a hybrid multi-level version using mpi + OpenCL has been contributed by the author of this thesis as a test application in [13].

4.1.4 Multi-level Large Vector Similarity Measure Computation

The application first proposed in [10] for verification of the proposed simulation method is large vector similarity measure computation for big data analysis in a parallel environment. The goal of the application is to compute a similarity matrix for a large set of points in a multidimensional space, assuming that the size of the processed data does not fully fit into the memory. The implementation in C/mpi uses the master-slave parallel programming paradigm, where the master partitions the input data into chunks and distributes them to slaves which compute Euclidean distances between the points in a given chunk.

The implementation that uses mpi allows to spawn slave processes across multiple nodes in a cluster where each process utilizes a single cpu core. Thus, in the case of running multiple processes per node, the application is hybrid in the multi-level sense, because it is parallelized across many nodes in a cluster and many cores within each node. An important parameter of the application that influences its execution time is the number of points in each data chunk. Finding the optimal value of this parameter through comparing times of many real executions could be prohibitively time consuming. For example, computing similarity measures for 2000 points in a space of 200000 dimensions on all 128 virtual cores of 16 "des" workstations (see Section 4.2.1) takes around 8 hours.

The main contribution of the author of this thesis in the paper was developing a model of the application and conducting the described experiments with simulation of the application. In order to estimate relatively fast the execution times of the application for different problem sizes, application parameters and utilized hardware, a simulation model was proposed. Execution times of the most significant computation and communication operations have been modeled as functions of the data chunk size. The model allows to find the optimal number of points in data chunk. What is more, it can be used to predict execution times while using different, currently unavailable computing resources.

4.1.5 Multi-level Deep Neural Network Training for Automatic Speech Recognition

The deep neural network training application optimized and modeled in the case study described in Section 6.2 of this thesis is a part of the automatic speech recognition (asr) system developed at the VoiceLab.ai company based in Gdańsk, Poland. One of the crucial elements of the recipe based on the Kaldi toolkit [155] is the acoustic model implemented as an artificial neural network. The goal of the acoustic model is to classify each audio frame in a given recording to one of the possible speech units called phonemes. The input of the network for each frame is a set of its features, in this case 13 mel-frequency cepstral coefficients (MFCC).

The specific application considered in this thesis is parallel neural network training with natural gradient and parameter averaging [156]

using chosen 100 hours from the internal VoiceLab.ai corpora consisting of over 4200 hours of Polish speech from about 5700 speakers. The model is a recurrent neural network constructed from 4 layers of long short-term memory (lstm) cells. The training consists of iterations of backpropagation algorithm performed by multiple gpus on separate copies of the model on different training data chunks and averaging the model weights at the end of each iteration. One training epoch consists of such a number of iterations that all training examples are used. In a usual training procedure, 15 training epochs are executed, which on two workstations, cuda5 and cuda6 from the VoiceLab.ai cluster (see Section

4.2.5) takes over 11 hours. Developing an efficient acoustic model requires testing many neural network architectures trained with multiple values of training parameters and thus, running many instances of the training. Execution time reduction would be a crucial improvement of the application.

The training can utilize gpus from multiple computing nodes in a cluster, which makes the application multi-level. It is also heterogeneous, because the backpropagation algorithm is executed on gpus, while data preprocessing and model averaging are executed on a cpu. What is more, the capability of the application to efficiently utilize different gpu models is often a practical requirement. Clusters used by companies are regularly upgraded with new, more powerful computing devices and proper load balancing is required to make the most of the whole cluster without inefficiencies resulting from lower speed of the older devices.

4.2 Investigated HPC Systems

Similarly to the applications described in Section 4.1, we aimed to base our experiments on many diverse high performance computing systems. In this section we describe chosen utilized systems, including a collection of laboratory workstations with gpus in Section 4.2.1, a collection of servers with computing accelerators in Section 4.2.2, a cluster with 864 cpu cores in Section 4.2.3, a pilot laboratory for massively parallel systems in Section 4.2.4 and a professional cluster of workstations with gpus in Section 4.2.5.

4.2.1 Des - Department Laboratory Workstations

"Des" workstations are machines available in the high performance computing and artificial intelligence laboratory located at the Department of Computer Architecture - home department of the author of this thesis. Although the main purpose of the laboratory is engaging students in classes concerning parallel algorithms, high performance computing systems and massively parallel processing, in certain reserved time windows it can also be used for experiments involving the heterogeneous hardware resources of the laboratory. In particular, the 18 laboratory machines called "des01-18", equipped with an Intel i7-2600K cpu with 8 logical cores and 8GB of RAM each can be utilized as a heterogeneous computing cluster, because all nodes have also NVIDIA GeForce GTS 450 gpus with 192 CUDA cores installed, except for one node with NVIDIA GeForce GTX 480 with 480 CUDA cores.

4.2.2 Apl - Department Computing Accelerator Servers

The computing resources available at the Department of Computer Architecture include also four servers called "apl09-12" with high performance cpus and computing accelerators. Apl09 and apl10 are equipped with Intel Xeon W3540 cpu with 8 logical cores, 12 GB of RAM and gpus: NVIDIA Tesla C2050 and NVIDIA GeForce GTX 560 Ti, both with 448 CUDA cores. Apl09 is also equipped with a NVIDIA Quadro FX 3800 gpu with 192 CUDA cores. Apl11 is a server with two Intel Xeon E5-2680 v2 cpus with 12 logical cores each, 64GB of RAM and two Tesla K20m gpus with 2496 CUDA cores each. Apl12 is a server with two Intel Xeon E5-2680 v2 cpus with 20 logical cores each and two Intel Xeon Phi 5100 accelerators with 240 logical cores each. Despite fast aging of equipment, especially the gpus, regular upgrades of the apl high performance computing server infrastructure makes it a good experiment environment, particularly in the context of efficient utilization of heterogeneous computing infrastructure by a single application.

4.2.3 K2 - Department High Performance Computing Cluster

The Department of Computer Architecture maintains also K2, a high performance computing cluster consisting of 3 racks of 36 nodes each. Each node is equipped with two Intel Xeon E5345 4-core cpus and 8GB of RAM, giving a total of 864 cpu cores. The nodes are connected with an InfiniBand interconnect supported by Mellanox Technologies MT25204 network cards. The cluster is particularly useful for applications that can scale to hundreds of cores, such as the idw interpolation application described in Section 4.1.3.

4.2.4 MICLAB - Pilot Laboratory for Massively Parallel Systems

MICLAB is a laboratory at the Institute of Computer and Information Sciences of the Technical University of Częstochowa, built within the project "Pilot laboratory for massively parallel systems". The aim of the project is to create a virtual laboratory, where the nationwide scientific community can investigate the usage possibilities and define application directions of contemporary massively parallel computing architectures in leading fields of science.

The computing infrastructure of the laboratory consists of 10 high performance computing nodes. Eight of them are equipped with two Intel Xeon E5-2699 v3 cpus and 256 GB of RAM each. The processors have 36 logical cores each, but the significant computing power lies also in the Intel Xeon Phi 7120P coprocessors with 244 logical cores each. Two such coprocessors are installed in the eight latter nodes, and also in two other nodes, each equipped with two Intel Xeon E5-2695 v2 cpu with 24 logical cores each and 128 GB of RAM. The two remaining nodes are also equipped with two Intel Xeon E5-2695 v2 cpus and 128 GB of RAM, but in their case the installed computing accelerators are two NVIDIA Tesla K80 gpus with 4992 CUDA cores each.

4.2.5 Cuda567 - Professional Workstations With GPUs

The professional high performance computing infrastructure at the VoiceLab.ai company is dedicated to deep neural network training applications, such as the one described in Section 4.1.5. The computing power for deep learning is based mostly on gpus, which are commonly used accelerators for this purpose. Cuda567 is a subset of the infrastructure, consisting of three nodes, each with 4 NVIDIA GeForce GTX Titan X gpus with 3072 CUDA cores. The first node, cuda5 is equipped with two Intel Xeon E5-2620 v3 cpus with 12 logical cores each and 128GB of RAM. The two other nodes, cuda6 and cuda7 are also equipped with 128GB of RAM, but stronger cpus: two Intel Xeon E5-4650 v2 cpus with 20 logical cores each.

5.1 Execution Steps

The following steps related to Claim 1 of this thesis are proposed to optimize the execution of a hybrid parallel application:

  1. preliminary process optimization - if possible, modification of the implementations of parallel processes , in such a way that the result of the application remains valid, but the sequence of operations is changed in order to reduce the process execution time;

  2. application execution optimization:

    1. [label=()]

    2. process mapping - finding the process mapping function ;

    3. parameter tuning - finding the vector of application execution parameters ;

  3. actual execution.

The first step has been included in the proposed execution steps in order to stress the importance of profiling and performance analysis of the application. In many practical situations, the most significant reduction of application execution time can be achieved through relatively straightforward modifications of the parallel algorithm that result for example in overlapping of certain operations or more efficient sequence of operations in terms of memory performance (e.g. loop tiling [157] or avoiding false sharing [158]). An analysis should be performed to determine the most time consuming parts of the application and identify the corresponding utilized hardware resources. If two consecutive operations require different hardware elements (for example a computing device and a network device), often an overlapping technique allows to perform these operations simultaneously. This step is particularly useful in cases of hybrid parallel applications where different hardware elements can be specialized for certain types of tasks, for example a cpu for handling I/O operations and a computing accelerator for massively parallel computations.

Step 2a is connected with the global static task mapping problem described in Section 3.2. Step 2b is connected with the offline auto-tuning of system parameters problem described in Section 3.3.

Steps 1, 2a or 2b may be omitted. For certain applications introducing preliminary process optimization may be infeasible. Similarly, there might be only one feasible process mapping or value of certain application execution parameters. The appropriate choice of performed execution steps may differ throughout applications and systems. If at all, the preliminary process optimization step should be performed as the first one, because the optimal process mappings and application execution parameters depend on the exact process implementations. They might also depend on each other, hence Steps 2a and 2b could be performed repeatedly in turns or performed simultaneously.

Although the last of the proposed steps, actual execution, may seem obvious and straightforward, performing it might require significant technical effort, especially in multi-level and heterogeneous systems. Dozens of software frameworks and programming interfaces are used for execution of parallel applications, depending on the target system, application characteristics and field, used programming language etc. In a multi-level heterogeneous system, a software solution is required that allows execution of the application on various types of computing devices available in the system, as well as communication through a hierarchical network infrastructure. Chosen software solutions for executing parallel applications in multi-level heterogeneous hpc systems have been described in Section 2.2.3. Many of them are mixing different APIs in one solution, which requires know-how and specialized programming effort.

In order to perform all proposed steps using one, easy to use software environment, we propose using KernelHive, a framework for parallelization of computations in multi-level heterogeneous hpc systems first introduced in the master thesis [159] by the author of this dissertation, available as free software111https://github.com/roscisz/KernelHive. The system allows parallelization of applications among clusters and workstations with cpus and gpus. KernelHive applications are developed using a graphical tool hive-gui by constructing a dataflow graph and implementing computational kernels assigned to the graph nodes. Custom graph nodes can be developed by implementing the IDataProcessor kernel interface, and a library of sample implementations and templates is provided. Automatic parallelization is possible through an unrollable node mechanism illustrated by Figure 5.1, where apart from a data IDataProcessor kernel, two other kernels are defined for the node: IDataPartitioner responsible for dividing the problem into a given number of subproblems and IDataMerger responsible for merging the results of multiple tasks solving these subproblems. This way, any application that implements the IDataPartitioner, IDataProcessor and IDataMerger kernels can be automatically parallelized to a given number of computing devices.


Figure 5.1: Illustration of the unrollable node mechanism in KernelHive [13]

Task mapping and allocation, data transfer and automatic parallelization through the unrollable nodes is performed under the hood, allowing programmers to benefit from parallel execution while focusing only on the application rather than the complicated parallelization internals. Details about the available infrastructure along with application progress can be monitored in a graphical tool [18]. Overview of the architecture of the KernelHive system is presented in Figure 5.2.


Figure 5.2: Architecture of the KernelHive framework [13]

The components of the framework are arranged in a hierarchical structure corresponding to the used computing system. An example of such a hierarchy can be seen in Figure 6.2. The central component is the Engine which interacts via a Simple Object Access Protocol (SOAP) interface with the computing nodes through instances of the Cluster component, part of the Cluster and node management layer implemented as Java system daemons installed in access nodes to particular clusters available in the system. The Cluster instances interact via a Transmission Control Protocol (TCP) interface with the underlying Unit component instances, running as C++ system daemons in each available computing node. Using this hierarchical architecture, the framework is able to discover available computing devices in all connected nodes. Data about the currently available computing devices, their hierarchy and state is gathered in an object-oriented data structure in the central Engine subsystem.

The proposed hierarchical framework architecture allows to take various characteristics of the system into account during optimization of the execution and, what is more, setting and auto-tuning various application execution parameters concerning different levels of the computing system. Technically, the optimization is done using a mechanism of interchangeable Optimizers focused on different goals, such as optimization of execution time, power consumption or application reliability. The user of the framework can develop and plug in a new Optimizer corresponding to their needs, but choosing from the available implementations or mixing them is also possible.

The essential processing in the KernelHive framework is performed by tasks implemented as OpenCL kernels. The process responsible for running consecutive tasks is unit, a subsystem of KernelHive running as a system daemon on a given node. Tasks in KernelHive are orchestrated by the engine subsystem, the central module of KernelHive, which supports submitting multiple applications for execution. The applications are represented as dag which nodes represent computing tasks and edges represent data flow. The Engine keeps track of the current state of each application, in particular which tasks have already been completed and for which the input data is already gathered so they are ready for execution. An interchangeable optimizer interface allows plugging in different scheduling implementations that periodically analyze the set of jobs ready for execution and decide which ones should be executed next and on which available computing devices, according to a given optimization strategy.

An improved and tested version of KernelHive has been described in [13] along with a specific execution methodology consisting of the following steps:

  • selection of computing devices;

  • determination of best grid configurations for particular compute devices;

  • determination of the preferred data partitioning and granularity;

  • actual execution.

The execution steps proposed in this thesis is a more general version of the latter, extended with Step 1, preliminary process optimization. Step 2a, process mapping is a broader term for selection of computing devices while Step 2b, parameter tuning includes determination of both grid configurations (one of the possible execution parameters) and data partitioning (one of the possible application parameters). The actual execution step remains unchanged.

5.2 Modeling and Simulation for Fast Evaluation of Execution Configurations

As noted in Section 5.1, steps 2a (process mapping) and 2b (parameter tuning) could be performed simultaneously. In fact, these two steps represent solving the optimization problem defined in Equation 1.1 in Section 1.2 when the process implementations impl are already optimized and will not be changed any more. Chosen solutions to similar optimization problems are discussed in Chapter 3. Different methods can be suitable for this task depending on the size of the search space and cost of evaluating each solution. If both the search space and the cost of solution evaluation are small, exhaustive search can be used, which guarantees finding a global optimum, because it consists of systematic evaluation of every possible alternative. In all other cases we propose using simulation method for fast solution evaluation.

In the cases of small search space but high cost of solution evaluation or small cost of solution evaluation but big search space without possibility of space pruning, combinatorial search methods could be used, such as local search, simulated annealing or evolutionary algorithms which do not guarantee finding neither a local nor global optimum. Availability of an accurate enough simulation method would allow to still perform exhaustive search. In extreme cases of prohibitively big search spaces and high costs of solution evaluation, such a simulation method could also be useful as a fast evaluation method for combinatorial search algorithms. In this Section we propose a simulation for these purposes, related to Claim 2 of this Thesis.

Searching for a suitable simulation tool, in [21] we reviewed chosen existing parallel application simulators and provided motivations for developing a new discrete-event simulator of parallel application execution on large-scale distributed systems. MERPSYS, the simulation environment proposed in [14] allows to accurately predict execution time and power consumption of parallel applications and analyze the power/time trade-off by performing the following steps:

  1. preparing the application model by defining:

    • the process implementations using the Editor graphical tool for writing code in a Java-based meta-language which provides API for modeling various types of computation and communication operations. This requires identifying the crucial operations and thus deciding on the granularity of the model by analyzing code of an existing application or providing them from scratch. The granularity level should allow to define the modeling functions described in point 3;

    • process requirements by inserting their values into a form;

  2. preparing the system model by using the Editor graphical tool for building the hardware graph from computing devices and network links available in a database;

  3. defining hardware capabilities by inserting values for each process into a form available after selecting a certain device in the Editor;

  4. defining certain modeling functions using a Web application for filling in JavaScript snippets that have access to computing device characteristics and operation parameters. The functions may be based on analytical performance and power consumption models. If possible, we suggest tuning these functions using results of real application executions. The following modeling functions are required by the proposed simulator:

    • - execution time of a computation operation comp using a computing device device;

    • - execution time of a communication operation comm using a network link networklink;

    • - idle power consumption of a computing device device;

    • - peak power consumption of a computing device device.

    Additionally, a hardware parameter is required, which denotes the number of cores of a computing device device.

  5. simulating the application execution and analysis of the resultant values of execution time and average power consumption through:

    • providing a scheduling mechanism which defines the process mapping function mapping by choosing or writing an implementation of a Scheduler programming interface;

    • using the Editor graphical tool for choosing specific values of application execution parameters executionparameters, enqueuing a single simulation instance and analyzing its results;

    • running one or more instances of the Simulator program which would execute in parallel all simulation instances enqueued in the simulation queue;

    • using the Web interface for enqueuing an optimizer suite - an automatically populated set of simulation instances based on the previously executed single simulation instance, with a range of varying values of certain application execution parameters in executionparameters and, thus, defining the space of application execution parameters feasibleexecutionparameters;

    • using a ParetoVisualizer tool for viewing a chart of results for all simulation instances in a suite with execution time and power consumption as axes, indicated set of Pareto-optimal solutions and values of the varying application execution parameters accessible by hovering over a data point.

The proposed simulation environment performs a discrete-event simulation that runs the application model codes of all defined processes and increases appropriate execution time and energy consumption counters for all computation and communication operations. It should be noted that, in the case of communication operations, the simulator ensures proper synchronization between the processes, so that for each process, possible waiting for another process is included in the execution time of the operation. The execution time of a process is modeled as the sum of the execution times of all computation and communication operations: