The emergence of heterogeneous systems is one of the most important milestones in parallel computing in recent years. A heterogeneous system is composed of general purpose CPUs and specific purpose hardware accelerators, such as GPUs, Xeon Phi and FPGAs. Under this concept, a wide range of systems are included, from powerful computing nodes capable of executing teraflops, to integrated GPU and CPU chips. This architecture allows, not only to significantly increase the computing power of the nodes, but also to improve their energy efficiency.
However, this architecture also presents a series of challenges, among which the complexity of its programming stands out. In this sense, the Open Computing Language (OpenCL) has been developed as an API that extends the C/C++programming languages for the programming of heterogeneous systems . But OpenCL leaves in the hands of the programmer the management of a large number of aspects that greatly complicates programming, which turns into an error-prone process, significantly reducing their productivity .
OpenCL is a language with a low abstraction level that forces the programmer to know the system architecture in detail. When using OpenCL, the programmer is responsible for explicitly determine the kind and architecture of the different devices in the heterogeneous system. He manages the communication between the host and each of the devices, as well as the distributed address memory space, making the copies of the input data and collect the results generated in each device. He is also responsible for selecting the devices on which each kernel is going to be enqueued as well as for partitioning the data among them.
To overcome these problems this paper presents EngineCL, a new OpenCL-based runtime API that significantly improves the usability of the heterogeneous systems without any loss of performance. It accomplishes complex operations transparently for the programmer, such as discovery of platforms and devices, data management, load balancing and robustness throughout a set of efficient techniques. EngineCL follows Architectural Principles with known Design Patterns to strengthen the flexibility in the face of changes. Following the Host-Device programming model, the runtime manages a single data-parallel kernel among all the devices in the heterogeneous system.
EngineCL has been validated both in terms of usability and performance, using two architectures with different devices, such as CPUs, integrated and discrete GPUs and Xeon Phi. Regarding usability, 8 metrics have been used, achieving excellent results in all of them. In terms of performance, the overhead is on average up to 1%. But in some devices even slight improvements are achieved over the baseline.
There are projects aiming at high-level parallel programming in C++, but most of them provide a C++API similar to the Standard Template Library (STL) to ease the parallel programming, like Boost.Compute , HPX , Thrust , SYCL  and the C++Extensions for Parallelism . While Thurst is tied to CUDA devices, HPX and the C++Technical Specification are not OpenCL centered, but projects like HPX.Compute  and SYCLParallelSTL  provide backends for OpenCL via SYCL. Projects like GrPPI , SkelCL  and SkePU  provide composable primitives and skeletons to build parallel applications. GrPPI gives interesting reusable patterns for stream and data-parallel processing with many backends, but not OpenCL. SkePU and SkelCL provide data management, but the programmer is responsible of using their own data containers. Also, there are C-programmed libraries with similar objectives, but they provide low-level APIs where the programmer needs to specify many parameters and the density of the code is considerable. While Maat  uses OpenCL to achieve the code portability, Multi-Controllers  is CUDA and OpenMP-centered, but allows kernel specialisation. On the other side, EngineCL targets a higher-level API with an application domain as execution unit, increasing significantly the usability. It provides different API layers, allows kernel specialisation, direct usage of C++containers, manages the data and work distribution transparently between devices and has negligible overheads compared with the previous projects.
The main contributions of this paper are the following:
Presents EngineCL, a runtime that notably simplifies the programming of data-parallel application on a heterogeneous system.
EngineCL ensures performance portability fully exploiting heterogeneous machines.
An exhaustive experimental validation, both of the usability and the performance of the runtime, which allows to conclude its excellent behaviour in both metrics.
The rest of this paper is organised as follows. Section 2 describes the design and implementation of EngineCL. Section 3 presents two examples of how to use the API. The methodology used for the validation is explained in Section 4, while the experimental results are shown in Section 5. Finally, in Section 6, the most important conclusions and future work are presented.
2 Design and Implementation
EngineCL has been designed with many principles in mind, all around three pillars: OpenCL, Usability and Performance.
It is tightly coupled to OpenCL and how it works. The modules of the system and its relations have been defined according on the most efficient and stable patterns. Every design decision has been benchmarked and profiled to achieve the most optimal solution in every of its parts, but mainly promoting the modules related with the data management, synchronisation and API abstraction.
While OpenCL allows code portability on different types of devices, the programmer is responsible for managing many concepts related to the architecture, such as platforms, devices, contexts, buffers, queues, kernels, kernel arguments, data transfers, kernel executions and error control sections. Figure 1 depicts a generic OpenCL program, conceptually and in density of code, compared with the EngineCL version. As the number of devices, operations and data management processes increases, the code grows quickly with OpenCL, decreasing the productivity and increasing the maintainability effort. EngineCL solves these issues by providing a runtime with a higher-level API that manages all the OpenCL resources of the underlying system independently.
EngineCL redefines the concept of program to facilitate its usage and the understanding of a kernel execution. Because a program is associated with the application domain, it has data inputs and outputs, a kernel and an output pattern. The data is materialised as C++containers (like vector), memory regions (C pointers) and kernel arguments (POD-like types, pointers or custom types). The kernel accepts directly an OpenCL-kernel string, and the output pattern is the relation between the global work size and the size of the output buffer written by the kernel. The default value is 1, because every work-item (thread) writes to a single position in the output buffers (, e.g. the third work-item writes to the third index of every output buffer). It is designed to support massive data-parallel kernels, but thanks to the program abstraction the runtime will be able to orchestrate multi-kernel executions (task-parallelism), prefetching of data inputs, optimal data transfer distribution, iterative kernels and track kernel dependencies and act accordingly. Therefore, the architecture of the runtime is not constrained to a single program.
The runtime follows Architectural Principles with well known Design Patterns to strengthen the flexibility in the face of changes. As can be seen in Figure 2, the Tier-1 API has been provided mainly because of a Facade Pattern, facilitating the use and readability of the Tier-2 modules, reducing the signature of the higher-level API with the most common usage patterns. The Buffer is implemented as a Proxy Pattern to provide extra management features and a common interface for different type of containers, independently of the nature (C pointers, C++containers) and its locality (host or device memory). Currently, it supports host-initialised C pointers and C++vector containers, and other types can be easily integrated with this pattern. Finally, the Strategy Pattern has been used in the pluggable scheduling system, where each scheduler is encapsulated as a strategy that can be easily interchangeable within the family of algorithms. Because of its common interface, new schedulers can be provided to the runtime.
EngineCL has been developed in C++, mostly using C++11 modern features to reduce the overhead and code size introduced by providing a higher abstraction level. Many modern features like rvalue references, initializer lists and variadic templates have been used to provide a better and simpler API, at the same time as efficient management operations are performed inside the runtime. When there is a trade-off between internal maintainability of the runtime and a performance penalty seen by profiling, it has been chosen an implementation with the minimal overhead in performance.
The runtime is layered in three tiers: Tier-1 and Tier-2 are accessible by the programmer. The lower the Tier, the more functionalities and advanced features can be manipulated. Most programs can be implemented in EngineCL with just the Tier-1, by using the EngineCL and Program classes. The Tier-2 should be accessed if the programmer wants to select a specific Device and provide a specialised kernel, use the Configurator to obtain statistics and optimise the internal behaviour of the runtime or set options for the Scheduler. The Tier-3 are the hidden inner parts of the runtime that allows a flexible system regarding memory management, pluggable schedulers, work distribution, high concurrency and OpenCL encapsulation.
The implementation follows feature-driven development to allow incremental features based on requested needs when integrating new vendors, devices, type of devices and benchmarks. Implementation techniques are profiled with a variety of OpenCL drivers from the major vendors and versions, but also in devices of different nature, such as integrated and discrete GPUs, CPUs and accelerators. EngineCL has a multi-threaded architecture that combines the best measured techniques regarding OpenCL management of queues, devices and buffers. Some of the decisions involve atomic queues, parallelised operations, custom buffer implementations, reusability of costly OpenCL functions, efficient asynchronous enqueueing of operations based on callbacks and event chaining. These mechanisms are used internally by the runtime and hidden from the programmer to achieve efficient executions and transparent management of devices and data.
In short, EngineCL has been designed following an API and feature-driven development to achieve high external usability (API design) and internal adaptability to support new runtime features as main objectives when the performance is not penalised. This is accomplished through a layered architecture and a set of core modules well profiled and encapsulated.
3 API Utilisation
Listing 3 depicts two benchmark examples using the EngineCL runtime, Binomial (left) and NBody (right). Both programs start reading its kernels, defining variables, containers (C++vectors) and OpenCL values like local work size and global work size (lws, gws). Then, each program uses and fills its previous variables based on the benchmark (init_setup). The rest of the program is where EngineCL is instantiated, used and released.
Regarding the Binomial example, the engine uses the very first CPU in the system by using a DeviceMask, usually a single chip from the OpenCL Driver’s point of view. Then, the global work size (total number of work-items) and local work size (work-items per work-group) are given to the engine using explicit methods. Now the application domain starts by instantiating the program. The input and output containers are set with in and out methods. With this statements the programmer notifies the runtime that the computation will need the values from the inputs before executing the kernel, but also that the outputs will be written after the execution. The out_pattern is set because the implementation of the Binomial OpenCL kernel uses a writing pattern of . Therefore, one work-item computes 255 output values. Then, the kernel is configured by setting its source code string, name and arguments. The above variables and containers can be used directly as kernel arguments. Also, the argument assignation can be done in aggregate and positional forms, allowing easy and flexible assignations. When local memory is needed, an enumerated LocalAlloc is used to determine that the assigned value represents the bytes of local memory to be reserved, reducing the complexity of the API. Finally, the runtime consumes the program and all the computation is performed. When the run method finishes, the output values are in the containers. As shown in the comments, errors can be checked and processed easily.
On the other side, the NBody program shows a more advanced example where EngineCL really excels. In this example, three different kernels are shown: one is a common NBody kernel, other is a specific implementation for GPUs, and the third one is a binary kernel built for the Xeon Phi. The Device class from the Tier-2 allows more features like platform and device selection by index (platform, device) and specialisation of kernels and building options. Three specific devices of the heterogeneous system are instantiated, two of them with special kernels (source and binary) by just giving to them the file contents. After defining the work-items in a single method, the runtime is configured to use the Static scheduler with different work distributions for the CPU, Phi and GPU. If the proportions are not set, the scheduler will choose different distribution patterns like even or device-type distribution (e.g. GPU greater than CPU), depending on the runtime configuration (not shown in this example). Finally, the program is instantiated and defined. In this case the out pattern is not needed because every work-item computes one output value () and the seven arguments are set in a single method, increasing the productivity even more.
[!htb] [fontsize=]c++ auto kernel = file_read(”binomial.cl”); auto samples = 16777216; auto steps = 254; auto steps1 = steps + 1; auto lws = steps1; auto samplesBy4 = samples / 4; auto gws = lws * samplesBy4; vector¡cl_float4¿ in(samplesBy4); vector¡cl_float4¿ out(samplesBy4);
binomial_init_setup(samplesBy4, in, out);
ecl::EngineCL engine; engine.use(ecl::DeviceMask::CPU); // 1 Chip
ecl::Program program; program.in(in); program.out(out);
program.out_pattern(1.0f / lws);
program.kernel(kernel, ”binomial_opts”); program.arg(0, steps); // positional by index program.arg(in); // aggregate program.arg(out); program.arg(steps1 * sizeof(cl_float4), ecl::Arg::LocalAlloc); program.arg(4, steps * sizeof(cl_float4), ecl::Arg::LocalAlloc);
// Optional: // if (engine.has_errors()) // for (auto& err : engine.get_errors()) // show or process errors auto kernel = file_read(”nbody.cl”); auto gpu_kernel = file_read(”nbody.gpu.cl”); auto phi_kernel_bin = file_read_binary(”nbody.phi.cl.bin”); auto bodies = 512000; auto del_t = 0.005f; auto esp_sqr = 500.0f; auto lws = 64; auto gws = bodies; vector¡cl_float4¿ in_pos(bodies); vector¡cl_float4¿ in_vel(bodies); vector¡cl_float4¿ out_pos(bodies); vector¡cl_float4¿ out_vel(bodies);
nbody_init_setup(bodies, del_t, esp_sqr, in_pos, in_vel, out_pos, out_vel);
ecl::EngineCL engine; engine.use(ecl::Device(0, 0), ecl::Device(0, 1, phi_kernel_bin), ecl::Device(1, 0, gpu_kernel));
auto props = 0.08, 0.3 ; engine.scheduler(ecl::Scheduler::Static(props));
ecl::Program program; program.in(in_pos); program.in(in_vel); program.out(out_pos); program.out(out_vel);
program.kernel(kernel, ”nbody”); program.args(in_pos, in_vel, bodies, del_t, esp_sqr, out_pos, out_vel);
As it is shown, EngineCL manages both programs with an easy and similar API, but completely changes the way it behaves: Binomial is executed completely in the CPU, while NBody is computed using the CPU, Xeon Phi and GPU with different kernel specialisations and work loads. All the platform and device discovery, data management, compilation and specialisation, synchronisation and computation is performed transparently in a few lines for the programmer.
EngineCL has been validated both in terms of usability and performance. Five benchmarks have been used to show a variety of scenarios regarding the ease of use and overheads compared with a native version in OpenCL C++. Table 1 shows the properties of every benchmark. Gaussian, Binomial, Mandelbrot and NBody are part of the AMD APP SKD, while Ray is an open source implementation of a Raytracer. These five benchmarks are selected because they provide enough variety in terms of OpenCL development issues, regarding many parameter types, local and global memory usage, custom structs and types, number of buffers and arguments, different local work sizes and output patterns.
These benchmarks compare the usage of a single device for both cases. However, the more devices the better EngineCL excels over OpenCL in terms of performance and usability thanks to its scheduling system, work distribution and API usability, but it exceeds the scope of this paper.
|Local Work Size||128||128||255||256||64|
|Number of kernel args||6||11||5||8||7|
|Use local memory||no||yes||yes||no||no|
|Use custom types||no||yes||no||no||no|
The validation of usability is performed with eight metrics based on a set of studies (, , , ), each one applied to every benchmark. These metrics determines the usability of a system and the programmer productivity, because the more complex the API is, the harder it is to use and maintain the program.
The McCabe’s cyclomatic complexity (CC) measures the number of linearly independent paths. It is the only metric that is better the closer it gets to zero, whereas for the rest a greater value supposes a greater complexity. The number of C++tokens (TOK) and lines of code (LOC, via cloc) determines the amount of code. The Operation Argument Complexity (OAC) gives a summation of the complexity of all the parameters types of a method, while Interface Size (IS) measures the complexity of a method based on a combination of the types and number of parameters. The OAC and IS of every implementation is the sum of the OAC and IS of its used methods, respectively. The maintainability worsens the more parameters and more complex data types are manipulated. On the other side, INST and MET measure the number of Structs/Classes instantiated and methods used. Finally, the error control sections (ERRC) measures the amount of sections involved with error checking. A ratio of is calculated to show the impact in usability per benchmark and metric.
Regarding the performance evaluation, the experiments are carried out on two different machines. The first machine, labeled as Batel, is composed of two Intel Xeon E5-2620 CPUs, a NVIDIA Kepler K20m GPU and an Intel Xeon Phi KNC 7120P. Thanks to the QPI connection the CPUs are treated as a single device, and it is so by the OpenCL Driver. The second system, labeled as APU, includes one AMD A10-7850K APU and an integrated Radeon R7.
Every benchmark has four custom problem sizes per device, each one with completion times between 5 to 25 seconds, depending on the device limits regarding memory and global work size. The problem sizes changed for each device are the image size for Gaussian, Ray and Mandelbrot, the number of options for Binomial and the the number of bodies for NBody. 20 iterations are executed contiguously without a wait period for every benchmark and problem size. An initial execution is discarded for every set of iterations to avoid warm-up penalties in some OpenCL drivers and devices.
To evaluate the performance of EngineCL the time overhead, expressed in percent, will be used as a metric. This overhead is computed as the ratio between the difference of the response times of one kernel for both EngineCL () and native version () and the time required by the native version, as following: .
This section shows the experiments performed to evaluate the usability introduced by EngineCL when a single device is used. Table 2 presents the values obtained for every benchmark (rows) in every of the eight metrics (columns), as is described in Section 4. Also, the average ratio per metric is presented.
For every program, the maintainability and testing effort is reduced drastically, as can be seen in metrics like ERRC and CC, reaching the ideal cyclomatic complexity. The error checking saving are on average 21 times less by using EngineCL, reducing the visual complexity of alternate paths for error control that are independent of the application domain (e.g. checking a correct OpenCL buffer creation is not related with the problem to solve).
The density of the code and complexity of the operations involved is reduced between 7.3 to 8.5 times compared with OpenCL, as it is shown with the number of tokens, complexity of the types and interface sizes. In programs like Ray the ratio for OAC is greater than in TOK, because the amount of parameters grows in both implementations, but managing complex types is harder in OpenCL.
The number of classes instantiated and used methods are around 5 and 2 times less than in the OpenCL implementation, mainly because it has been deliberately instantiated the Device and one argument per line is used (program.arg), instead of using DeviceMask to avoid direct instantiations and a more contract specification of arguments in a single line (program.args).
As a summary, EngineCL has excellent results in maintainability, implying less development effort. Thanks to its API usability, the programmer is able to focus on the application domain, and its productivity is boosted by hiding complex decisions, operations and checks related with OpenCL.
This section presents results of experiments performed to evaluate the overhead introduced by EngineCL when a single kernel is executed in a single device, as is described in Section 4. Figure 3 shows the overhead results in Batel. Each row presents the results of a different device, CPU, Xeon Phi and GPU, while each column corresponds to a benchmark. Four results are shown per benchmark, each one with a different problem size. The ordinate indicates the overhead measured. It should be noted that negative overhead values indicate that running with EngineCL is more efficient (uses less time) than running natively with OpenCL.
Analysing each device separately, it can be observed that the worst results are obtained in the CPU, with an average overhead of 1.08% and a maximum of 2.69% in Ray, with the smallest problem size. This is reasonable since EngineCL also runs on the CPU, so it interferes with the execution of benchmarks, stealing them computing capacity. Regarding the discrete devices, the Xeon Phi presents the best results with a negative average overhead -0.3%, which indicates that, on average, the EngineCL version is more efficient than the native one. Finally, the results achieved with the GPU are also excellent, with an average overload of 0.3% and a maximum value of 1.26%. The differences between GPU and Xeon Phi are explained by the different implementation of the OpenCL driver and how it is affected by the multi-threaded and optimised architecture of EngineCL.
Figure 4 presents the same values as Figure 3 but evaluated in the APU system. In this case the results for discrete devices are even better. For the CPU, the average overhead is only 0.12% while the worst case is 0.98%, practically negligible. With respect to the integrated GPU, most experiments show small gains, rather than losses, resulting in a negative mean overhead value of -0.12%.
In summary, we can conclude that EngineCL can not only execute kernels on different devices almost without any loss of performance, but also in many cases it obtains a better performance than the corresponding OpenCL version. Furthermore, the results are very stable between different devices (discrete or integrated), as well as with different benchmarks and problem sizes. This excellent performance, together with its proven usability, makes EngineCL a very powerful tool for exploiting all kind of heterogeneous systems.
6 Conclusions and Future Work
Given the great relevance of heterogeneous systems in all sectors of computing, it is necessary to provide the community with tools that facilitate their programming, while maintaining the same performance. For this purpose, EngineCL is presented, a powerful OpenCL-based tool that greatly simplifies the programming of applications for heterogeneous systems. This runtime frees the programmer from tasks that require a specific knowledge of the underlying architecture, and that are very error prone, with a great impact on their productivity.
The API provided to the programmer is very simple, thus improving the usability of heterogeneous systems. This statement is corroborated by the exhaustive validation that is presented, with a large quantity and variety of Software Engineering metrics, achieving excellent results in all of them. On the other hand, the careful design and implementation of EngineCL allows that in many of the experiments carried out, it obtains slight improvements with respect to the native OpenCL version. In the rest of the cases, the overhead due to the management performed by EngineCL is negligible, always below 3% in all the cases studied and with an average overhead between 0 to 1%, achieving an excellent portability performance.
In the future, it is intended to extend the API to support iterative and multi-kernel executions. Also, load balancing algorithms will be provided and studied as part of the scheduling system to support a suitable co-execution on multiple devices simultaneously.
This work has been supported by the the Spanish Ministry of Education (FPU16/ 03299 grant), the Spanish Science and Technology Commission (TIN2016-76635-C2-2-R), the European Union’s Horizon 2020 research and innovation programme and HiPEAC Network of Excellence (Mont-Blanc project under grant 671697).
-  Bandi, R.K. et al.: Predicting maintenance performance using object-oriented design complexity metrics. IEEE Transactions on Software Engineering (2003)
-  Copik, M., Kaiser, H.: Using SYCL As an Implementation Framework for HPX.Compute. Proceedings of the 5th Int. Workshop on OpenCL, IWOCL (2017)
-  De Souza, C.R. et al: Automatic evaluation of API usability using complexity metrics and visualizations. 31st Int. Conf. Software Engineering, ICSE (2009)
-  del Rio Astorga, D., Dolz, M.F., Fernández, J., García, J.D.: A generic parallel pattern interface for stream and data processing. Concurrency and Computation: Practice and Experience, CCPE (2017)
-  Enmyren, J., Kessler, C.W.: SkePU: A multi-backend skeleton programming library for multi-gpu systems. Proc. 4th Int. Workshop on High-Level Parallel Programming and Applications (2010)
-  Gaster, B.R. et al. : Heterogeneous Computing with OpenCL - Revised OpenCL 1.2 Edition, Morgan Kaufmann (2013)
-  Group, T.K.: SYCL: C++ Single-source Heterogeneous Programming for OpenCL. SYCL 1.2.1 Specification, accessed on Feb 2018
-  Heller, T. et al. : HPX – An open source C++ Standard Library for Parallelism and Concurrency. Workshop on Open Source Supercomputing (2017)
-  Hoberock, J., Bell, N.: Thrust: A Parallel Template Library for C++ (2009)
-  ISO/IEC: Technical Specification for C++ Extensions for Parallelism (2015)
-  Moreton-fernandez, A. et al. : Multi-Device Controllers : A Library To Simplify The Parallel Heterogeneous Programming. Int. J. of Parallel Programming (2017)
-  Pérez, B. et al. : Simplifying programming and load balancing of data parallel applications on heterogeneous systems. Proc. of the 9th Workshop on General Purpose Processing using GPU (2016)
-  Rama, G.M., Avinash Kak: Some structural measures of API usability. Software - Practice and Experience (2013)
-  Ruyman Reyes, A.V., Harries, A.: SyclParallelSTL: Implementing ParallelSTL using SYCL. Int. Workshop on OpenCL (2015), accessed on Feb 2018
-  Scheller, T., Kühn, E.: Automated measurement of API usability: The API Concepts Framework. Information and Software Technology (2015)
-  Steuwer, M., Kegel, P., Gorlatch, S.: SkelCL - A portable skeleton library for high-level GPU programming. IEEE IPDPSW (2011)
-  Szuppe, J.: Boost.compute: A parallel computing library for C++ based on OpenCL. Int. Workshop on OpenCL (2016), accessed on Feb 2018