libhclooc: Software Library Facilitating Out-of-core Implementations of Accelerator Kernels on Hybrid Computing Platforms

08/15/2018 ∙ by Daniel Hanlon, et al. ∙ 0

Hardware accelerators such as Graphics Processing Units (GPUs), Intel Xeon Phi co-processors (PHIs), and Field-Programmable Gate Arrays (FPGAs) are now ubiquitous in extreme-scale high performance computing (HPC), cloud, and Big data platforms to facilitate execution of workloads that demand high energy efficiency. They present unique interfaces and programming models therefore posing several limitations, which must be addressed to facilitate execution of large workloads. There is no library providing a unifying interface that allows programmers to write reusable out-of-core implementations of their data-parallel kernels that can run efficiently on different mainstream accelerators such as GPUs, PHIs, and FPGAs. We address this shortage in this paper. We present a library called libhclooc, which provides a unifying interface facilitating out-of-core implementations for data parallel kernels on the three different mainstream accelerators (GPUs, Intel Xeon Phis, FPGAs). We implement out-of-core matrix-matrix multiplication (MMOOC) using the libhclooc API and demonstrate its superior performance over vendor implementations. We show that it suffers from a maximum overhead of 10 abstraction) compared to the state-of-the-art optimised implementations for Nvidia K40c GPU, Nvidia P100 PCIe GPU, and Intel Xeon Phi 3120P respectively. We also show that using libhclooc API reduces the number of lines of code (LOC) by 75

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Extreme-scale high performance computing (HPC), cloud, and Big data platforms today feature hybrid nodes containing multicore CPU processors and one or more accelerators such as Graphics Processing Units (GPUs), Intel Xeon Phi co-processors (PHIs), and Field-Programmable Gate Arrays (FPGAs) to facilitate execution of workloads that demand high energy efficiency (high performance and low energy consumption).

There are several challenges posed to execution of large problem sizes on these hybrid nodes arising from the tight integration of the accelerators with multicore CPUs via PCI-E communication links and disparate streaming interfaces. The challenges are summarized below:

  • Limited main memory of accelerators. The size of main memory in accelerators is small compared to that of the host multicore CPU connected to it. For example, consider the Top500 list of supercomputers [1]. The Tianhe-2 supercomputer ranked two is composed of Intel Ivybridge multicore CPUs, which support 768 GB per socket, and Intel Xeon Phi 31S1P accelerator, which provides only 8 GB main memory. The TSUBAME3.0 ranked thirteen comprises of Intel Broadwell CPUs, which can support 1.54TB main memory, and NVIDIA Tesla P100 SXM2 accelerators, which support only 16GB of main memory. Therefore, to execute large data-parallel workloads using these accelerators, out-of-core (or out-of-card) implementations are necessary.

  • Communication and computation overlap. Accelerators such as GPUs provide advanced hardware support to facilitate overlap of data transfers between host and device and computations on the device. For example, modern Nvidia GPUs (K40, K80, P100, etc) provide three engines: two copy engines, one for host-to-device transfers and another for device-to-host transfers, and a kernel engine. Therefore, libraries aiming to provide efficient out-of-core implementations must take into account the differences in hardware support for effective communication-computation overlap to optimize their software pipelines for out-of-core implementations.

  • Disparate streaming interfaces. Different vendors provide different streaming interfaces for data transfers between host CPU and accelerators and kernel executions on the accelerators. Nvidia provides CUDA streams and events [2] for its GPUs; Intel provides offload streams [3] for Intel Xeon Phis, which will be replaced soon by hStreams [4]; OpenCL command queues [5] are typically used for FPGAs. The interfaces exhibit significant differences in their APIs, programming models, and memory management. For example., Intel offload streams offer a combination of API functions and compiler pragmas. This wide disparity in the interfaces means that the programmers need to write different implementations for their out-of-core accelerator kernels with little code reuse. This can drastically impact programmer’s productivity.

There is severe shortage of software libraries that address these challenges comprehensively. The CUBLAS-XT library [6] provides a set of BLAS routines that support out-of-core operation. MAGMA [7] provides out-of-core dense LU, Cholesky, and QR factorizations. Victream [8] is a directed acyclic graph (DAG) computing framework for out-of-core computations on multiple GPUs.

There is no library providing a unifying interface that allows programmers to write reusable out-of-core implementations of their data-parallel kernels that can run efficiently on different mainstream accelerators such as GPUs, PHIs, and FPGAs. We address this shortage in this paper.

We present a library called libhclooc, which provides a unifying interface facilitating out-of-core implementations for data-parallel kernels on the three different mainstream accelerators (GPUs, Intel Xeon Phis, FPGAs). Its fundamental building blocks are CUDA streams and events that allow concurrent utilization of the copy and execution engines provided in NVidia GPUs [2], Intel offload streams [3] for Intel Xeon Phis, and OpenCL command queues [5] for FPGAs.

We implement out-of-core matrix-matrix multiplication (MMOOC) using the libhclooc API and demonstrate that it outperforms vendor implementations (CUBLAS-XT [6]) and suffers from a maximum overhead of 10%, 4%, and 8% (due to abstraction) compared to the state-of-the-art accelerator-specific optimized implementations for Nvidia K40c GPU, Nvidia P100 PCIe GPU, and Intel Xeon Phi 3120P respectively. We show that using libhclooc API reduces the number of lines of code (LOC) by 75% thereby drastically improving programmer productivity.

To summarize, our main contributions in this paper are:

  • Software library libhclooc that presents a unifying interface for CUDA streams, Intel offload streams, and OpenCL command queues thereby allowing programmers to write efficient reusable out-of-core implementations of their data-parallel kernels for three mainstream accelerators (GPUs, Intel Xeon Phis, FPGAs).

  • Implementation of an out-of-core matrix-matrix multiplication using libhclooc that executes on three different accelerators with an abstraction overhead in the library of 10%. The implementation reduces the number of lines of code (LOC) by 75%.

  • Implementation of an out-of-core matrix-matrix multiplication using libhclooc that outperforms the CUBLAS-XT implementation on Nvidia P100 PCIe GPU by 4x.

The paper is organized as follows. Section II presents related work. Section III contains the overview of interfaces of libhclooc. Section IV presents design and implementation details of libhclooc. Section V describes MMOOC using libhclooc API. Section VI contains the experimental results. Section VII concludes the paper.

Ii Related Work

We organize our literature survey into three categories. First category contains research works proposing out-of-core techniques and implementations for accelerator kernels. Second category reviews libraries specifically for out-of-core implementations of accelerator kernels. Final category reviews research proposals for streams supporting execution of scientific kernels using multicore CPUs and accelerators.

Ii-a Out-of-core Techniques and Implementations for Accelerator Kernels

Gu et al. [9] present an out-of-core implementation of FFT kernel for a single GPU where they overlap kernel computations on the GPU and communications over the PCI-E bus. Mu et al. [10] propose an out-of-core algorithm for LU decomposition.

Ziming et al. [11],[12] propose an out-of-core implementation for matrix multiplication routine (DGEMM) for NVidia GPU. Wu et al. [13] presented an out-of-core dense matrix multiplication implementation for CPU-GPU platforms similar to [11],[12].

Sabne et al. [14] present a computation splitting technique that automatically adjusts the number of pipeline stages to improve the performance of out-of-core implementations on multiple GPUs attached to the same host CPU.

Shirahata et al. [15] present out-of-core techniques for large-scale graph processing applications for heterogeneous GPU-based clusters.

Kabir et al. [16], Haidar et al. [17]

propose out-of-core implementations for large dense singular value decompositions (SVD) for CPU architectures.

Yamazaki et al. [18] present out-of-core algorithms to factorize a symmetric indefinite matrix for CPU and GPU architectures.

Hamid et al. [19] present out-of-core implementations for matrix multiplication for three accelerators (GPU, PHI, FPGA). However, the implementations suffer from heavy code duplication and use disparate interfaces such as CUDA streams [2], PHI streams [3], and OpenCL command queues [5]. The lack of reusable components in the design of these software implementations motivates the design and implementation of our library in this work.

Ii-B Libraries for Out-of-core Implementations of Accelerator Kernels

The CUBLAS-XT library [6] provides a set of BLAS routines that utilize multiple GPUs connected to the same motherboard. It uses CUDA streams [2] and events to efficiently manage data transfers across PCI-Express bus and kernel invocations on the GPUs. The routines in the library also support out-of-core operation where the size of the matrices are limited only by the system memory size.

MAGMA [7] provides out-of-core dense LU, Cholesky, and QR factorizations.

Victream [8] is a directed acyclic graph (DAG) computing framework for out-of-core computations on multiple GPUs. At the heart of Victream is a scheduler that employs locality-aware scheduling and data prefetching for performance optimization.

Ii-C Stream Libraries Supporting Accelerator Kernel Computations

CUDA streams [2] facilitate efficient overlapping of data transfers (from host to device or device to host) and kernel computations on the device. All device operations (kernel execution, data transfers from host to device or device to host) take place in a stream (“null stream” or default stream if not specified). Since all operations in non-default streams are non-blocking with respect to the host code, to synchronize the host code with operations in a stream, CUDA events are used.

hStreams [4] provides a streaming, task queue abstraction for heterogeneous platforms similar to CUDA streams [2] and OpenCL [5].

Our library, libhclooc, presents an uniform interface and implementation for different types of streaming interfaces, CUDA streams [2], PHI offload streams [3] and OpenCL [5]. We intend to integrate hStreams in our library in our future work.

From the survey, we can conclude that there is an abysmal lack of libraries providing an uniform interface for solving efficiently large problem sizes on hybrid platforms containing two ore more state-of-the-art accelerators. This can be a severe shortcoming for effective use of high performance accelerators in the fields of HPC and Big Data. We address this shortage in this work.

Iii libhclooc: Overview

Fig. 1: Software organization of libhclooc.

In this section, an overview of libhclooc library is presented.

The software organization of libhclooc is shown in the Figure 1. The class hclStreamFactory contains a factory method to create an instance of hclStream class for an input device. The classes hclCUDAStream, hclPhiStream and hclOCLQueue are wrappers around CUDA streams and events, Intel offload streams, and OpenCL command queues providing a simplified programming model for these interfaces.

The class hclRuntimeFactory contains a factory method to create an instance of hclRuntime class for an input device. The hclRuntime class provides an uniform interface for memory allocation and data transfers between host and device. The device-specific classes hclCUDARuntime, hclPhiRuntime, hclOCLRuntime, interact with device-specific classes for streams and events.

The component, hclMatrixPartitioner, provides API for partitioning a matrix into blocks that will fit on the device memory.

The simplified programming model will now be presented, which are the runtime interface, how memory is managed, and asynchronous aspects of the library.

Iii-a Programming Model

The programming model presented by the uniform interface is very explicit. The idea is that the host and device are separate in the sense that the device and host have their own memory, and the host acts as the controller of the device providing it with explicit commands. The host must explicitly issue a command to the device to allocate memory, to transfer data to and from the device, to free memory, and to execute actions on the data. This model allows programmers to more easily reason about the behavior and state of the accelerator.

The API available to the host is provided in the class hclRuntime. An instance of this class is created via the hclRuntimeFactory factory method as either an Intel runtime or CUDA runtime or OpenCL runtime. The runtime is not device specific, only device type specific, i.e., one runtime object can be used for many devices of the same type.

Further abstractions are provided for devices with hclDevice, for streams with hclStream, and events with hclEvent. These classes are primarily data containers, only storing information related to their abstraction, and are used as parameters for the functions within hclRuntime.

Iii-B hclRuntime Methods

The available functions in hclRuntime are given below with a brief description of their functionality. Their parameters not shown for brevity.

  • hclMalloc: Allocate memory on a device.

  • hclFree: Free memory on a device.

  • hclGetMemSize: Get the amount of available memory on a device.

  • hclMemCpy: Copy memory synchronously to or from device.

  • hclMemcpyAsync: Copy memory asynchronously to or from device.

  • hclDeviceSynchronize: Block program execution until all preceding tasks on the device are complete.

  • hclStreamSynchronize: Block stream execution until all tasks in the stream have completed.

  • hclWaitEvent: Block stream execution until an event is marked as complete.

Iii-C Memory Management

The memory management functionality of libhclooc is very straightforward, consisting of hclMalloc and hclFree. This is very similar to CUDA, but with some restrictions.

hclMalloc can only allocate memory for type double. This is due to a restriction with Intel Offloads where void pointers cannot be used to allocate memory, the pointer needs a concrete type. More implementations of hclMalloc for different pointer types can be created; double was used as that was what was needed initially.

Further, hclMalloc takes a double pointer, **ptr, as the parameter for the pointer to the memory allocation. The inner pointer is used as the device memory pointer after hclMalloc is called. It is important that the parameter is always a pointer to a pointer due to how Intel offloads handle heap allocation. It requires a section of memory to be allocated on the host which will be mapped to the section of memory on the device, which is created for each hclMalloc for Intel devices. This is then freed when hclFree is called. Similarly hclFree takes double pointers, but can also be extended later.

hclGetMemSize is a helper function to get the total amount of free memory on a device. This function aids the programmers to decide when to invoke in-core or out-of-core kernel.

Iii-D Data Transfers

Data transfers between host and device again follows a similar model to CUDA. hclMemcpy is used to copy data synchronously, while hclMemcpyAsync is used to copy data asynchronously.

Both functions take an enum, Direction, as a parameter which specifies which whether to copy data from host-to-device (H2D) or device-to-host (D2H).

Iii-E Asynchronous functions, Streams, and Events

The interface provides some asynchronous functions to facilitate the Stream Engine component of the library. They are identified by the Async suffix. Asynchronous methods are required to be associated with a stream, and optionally an event. The stream is essentially a queue of commands to be executed on a device. When an asynchronous method is called, it is added to the stream. Events provide a way to check if a specific function has completed.

Device specific streams are encapsulated in a hclStream class. A hclStream instance is created using the hclStreamFactory factory method by providing the device object as input. A stream is directly associated with a device.

Device specific events are encapsulated in a hclEvent class. An uninitialised hclEvent variable can be passed as a final argument to asynchronous functions, which will create and initialise the event. The hclWaitEvent function can then be used to wait for the given event to complete. By utilising multiple streams and events, kernel executions can be overlapped on the device with data transfers between the host and the device.

The hclStreamSynchronize method of the libhclooc runtime object will hold program execution until all queued functions in a stream are finished. Similarly, the hclDeviceSynchronize method will hold execution until all functions across all streams on a device are completed.

Iv libhclooc: Design and Implementation

This section covers the design and implementation of libhclooc. The goal is to design a uniform interface for key parts of three disparate accelerator interfaces: CUDA, Intel offloads, and OpenCL command queues.

Iv-a Library Design

The library was designed to be maintainable and extensible, with the interface being familiar to programmers with prior experience writing code for accelerators. It was designed to feel similar to the CUDA C interface, but in the interest of maintainability it was implemented using C++.

Iv-A1 Static vs Dynamic Polymorphism

One of the main reasons for using C++ was the availability of polymorphism and inheritance. As the library unifies three disparate interfaces, one interface is declared with multiple possible implementations underneath it.

C++ offers two forms of polymorphism, static and dynamic. Dynamic polymorphism is resolved at runtime while static is resolved at compile time. Dynamic polymorphism was used due to the flexibility of being able to change implementation without requiring recompilation.

The uniform interface is declared in hclRuntime as a pure virtual base class, and therefore not instantiable. This is then implemented in the derived classes hclCUDARuntime, hclPhiRuntime, and hclOCLRuntime for CUDA, Intel offloads, and OpenCL command queues respectively. An instance of these classes can be created by the hclRuntimeFactory factory method by simply specifying a tuple, a name and an id. This simplistic inheritance structure is easy to understand, provides runtime flexibility, and is simple to extend the interface if needed.

Using dynamic polymorphism will have a performance impact since to determine what implementation to use, a lookup to the virtual function table is required. Static polymorphism will not have this performance impact as implementation is determined at compile time, but the implementation would be more complicated. A more complicated implementation could lead to difficulty maintaining the code, which was decidedly not worth the slight performance improvement. The major bottleneck is the computation being performed on the large problem sizes, which can be many orders of magnitude longer than the time needed to resolve the function implementation. Further, if the specific program did not require the flexibility of changing implementations at runtime, the virtual table lookup may be bypassed by specifying the fully qualified class name for each method call.

Iv-B Implementation

The device specific interfaces, CUDA, Intel offloads, and OpenCL command queues, are disparate, with different models and behavior. A deep technical understanding of the behavior of underlying interfaces was required to develop a single unified interface. To produce the model presented earlier in this paper for these interfaces involved some implementation compromises, especially with memory management.

Iv-B1 Memory Allocation

In CUDA, the host and device memory are clearly separate with explicit functions to allocate memory on the device and copy memory over to an explicit location on the device. Intel attempts to be more transparent, the idea being the memory allocation existing on the host is used as a reference to the related memory on the device. It is possible to implement the appearance of separate memory which must be explicitly managed with Intel offloads but requires careful consideration.

To implement the model of separate host and device memory involved hiding the fact Intel offloads require an allocation on the host of equal size to the equivalent memory allocation on the card. When hclMalloc is called and the Intel Offload implementation is used, a block of memory the same size as the required memory block on the device is created. In practice this can mean there is double memory allocation on the host for any block of memory the programmer wants to copy to the device. The memory block is only allocated in the virtual memory, no data is assigned to the memory so it is not physically allocated. The memory allocation purely acts as a block of address space to map to memory on the device. The doubling up of memory allocation can lead to memory issues as in theory virtual memory will fill up twice as fast, but the maximum virtual address space is exceedingly large (theoretically 16 Exabytes on a 64-bit system) it should not be an issue.

Iv-B2 Intel Offload Pragma Limitations

Of the two interfaces, Intel offload pragmas is by far the more limited. Pragmas can not be used to allocate memory of a generic type on the device, there must be a concrete type. Instead of the programmer specifying the size of the allocation in bytes, a specific pointer type must be provided along with the length of the memory allocation. The size of the allocation is equal to the length of the allocation multiplied by the byte size of the pointer type. A void pointer type used anywhere in an offload pragma will throw a compile time error. This places restrictions on how generic functions can be. The most common type of variables in scientific kernels are doubles, which were therefore used when required instead of generic pointers. Functions could be made more generic using templates, but function templates are not compatible with abstract classes. Although CUDA can allocate arbitrary number of bytes to a void pointer, it had to conform to the limitations of Intel offloads.

There are also some limitations in relation to events and streams. In CUDA, events are added to a stream; the event is not directly linked with a command, it is marked as complete when the stream execution reaches the event. The Intel offload equivalent of a CUDA event, called signals, must be directly associated with a specific offload. Intel events do not require to be linked to a stream, whereas CUDA events do, which places a restriction on Intel offloads. The result of this is events must be directly related to specific asynchronous functions in the unified interface, which require a stream. This is not necessarily a big limitation for the libraries intended use case, in fact directly associating events with functions can be easier to reason about.

V Out-of-core Matrix-Matrix Multiplication (MMOOC) using libhclooc API

1hcld *d = hclDeviceFactory::create(name,id);;
2hclRuntime *r = hclRuntimeFactory::create(d);
3hclStream **s = hclStreamFactory::create(d,2);
4hclMatrixPartitioner(M,N,K,dMemSize,&h,&w,…);
5hclEvent *eA[h * w], *rA[h * w],
6*rB[w], *eC[h * w],  *rC[h * w];
7for (j = 0; j < w; j++) {
8  for (i = 0; i < h; i++) {
9    idx=i+j*h; idx1=idx%2;
10    idx2=(idx+1)%2; idx3=idx+1;idx4=idx-1;
11    i_=idx3%h;j_=idx3/h;
12    if (idx == 0) {
13      r->hclMemcpyAsync(d,H2D,s[idx1],&rB[j]);
14      r->hclMemcpyAsync(d,H2D,s[idx1],&rA[idx]);
15      if ((*step) == 0)
16        r->hclMemcpyAsync(d,H2D,s[idx1],&rC[idx]);
17    }
18    if (idx < (h * w - 1)) {
19      r->hclWaitEvent(d,rB[j],s[idx1]);
20      r->hclWaitEvent(d,rA[idx],s[idx1]);
21      r->hclWaitEvent(d,rC[idx],s[idx1]);
22      r->hclDgemmAsync(d,…,s[idx1],&eA[idx]);
23      if (idx > 0)
24         r->hclWaitEvent(d,eA[idx4],s[idx2]);
25      r->hclMemcpyAsync(d,H2D,s[idx2],&rA[idx3]);
26      if ((*step) == 0) {
27        if (idx > 0)
28           r->hclWaitEvent(d,eC[idx4],s[idx2]);
29       r->hclMemcpyAsync(d,H2D,s[idx2],&rC[idx3]);
30      } else {
31        if (idx > 0) {
32          r->hclWaitEvent(d,eC[idx4],s[idx2]);
33          r->hclMemcpyAsync(d,H2D,s[idx2],&rC[idx3]);
34        }
35      }
36      if (i == (h - 1)) {
37        r->hclWaitEvent(d,eA[idx],s[idx2]);
38        r->hclWaitEvent(d,eA[idx4],s[idx2]);
39        r->hclMemcpyAsync(d,H2D,s[idx2],&rB[j_]);
40      }
41      if (((*step) == nsteps) || (idx < (h*w-2)))
42        r->hclMemcpyAsync(d,D2H,s[idx1],&eC[idx]);
43    } else {
44      r->hclWaitEvent(d,rB[j],s[idx1]);
45      r->hclWaitEvent(d,rA[idx],s[idx1]);
46      r->hclWaitEvent(d,rC[idx],s[idx1]);
47      r->hclDgemmAsync(d,…,s[idx1], &eA[idx]);
48      if ((*step) == nsteps) {
49       r->hclMemcpyAsync(d,D2H,s[idx1],&rC[idx3]);
50       r->hclWaitEvent(d,rC[idx3],s[idx1]);
51      }
52    }
53  }
54}
55r->hclStreamSynchronize(d, s[0]);
56r->hclStreamSynchronize(d, s[1]);
57(*step)++;
Fig. 2: Out-of-core matrix-matrix multiplication of two matrices A and B of dimensions and respectively using libhclooc API.

Figure 2 shows the out-of-core implementation of matrix-matrix multiplication using libhclooc API. The implementation computes , where , , and are matrices of dimensions , , and , respectively and and are constant floating-point numbers.

The inputs to the implementation are the matrices , , , , , the tuple representing the device , the memory size of the accelerator , and the number of invocations of the implementation given by . The parameter represents the fact that this out-of-core implementation could be called as a subroutine, for example in parallel matrix-matrix multiplication such as Scalable Universal Matrix Multiplication Algorithm (SUMMA) [20], Hierarchical SUMMA (HSUMMA) [21], which contain number of main steps equal to .

The implementation is executed on a device represented by a tuple, . For a device that is GPU with ID 0, the device is represented by the tuple, {“GPU”,0}. A Xeon Phi device with ID 0 is represented by a tuple, {“PHI”,0}. Similarly, a FPGA device with ID 0, the device is represented by the tuple, {“FPGA”,0}.

The handle to libhclooc runtime is created in Line 2 using the device object as input. Two streams are then created in Line 3 using the stream factory method. Since creating a new stream has some overhead, we exploit and reuse just two streams in a round robin order so that while one stream is involved in doing computation, the other is transferring data across the PCI-E link.

The function hclMatrixPartitioner splits matrix into equal horizontal slices, matrix into equal vertical slices, and matrix into equal rectangular blocks ensuring that the data required for updating of any two blocks of in the same column is small enough to fit in the accelerator’s memory given by the size .

To synchronize the computations on the device and the transfers of slices of and and blocks of , five sets of events are created (Line 5-6).

The implementation consists of main steps. In each main step, a block is computed. Lines 12-17 contains data transfers of slices , , and block using stream 0 from host to device (represented by the macro H2D). Then, three events , , and are recorded in the stream (Lines 19-21). The process then waits for the events to be signaled, which takes place after the completion of the data transfers. When the events are signaled, in-core DGEMM kernel invocation (Line 22) is invoked on the sub-matrices (, , and ).

Fig. 3: Decomposition of matrix into 4 horizontal slices, matrix into 2 vertical slices, and matrix into 8 () blocks.
Fig. 4: Pipeline structure for the out-of-core matrix-matrix implementation of sample matrices shown in Figure 3. Concurrent data transfers in two directions are represented by calls and overlapping of data transfers and kernel executions (represented as DGEMM). Events, and , are used for synchronization of data transfers.

To illustrate the software pipeline employed in the out-of-core implementation, we consider a simple example containing three matrices , and whose dimensions are , and respectively. We will assume the main memory size of the accelerator to be 44 elements. Since the total workload size (elements) of the matrices (80) exceeds 44, the matrices are partitioned as shown in the Figure 3.

The software pipeline is composed of five stages described below:

  • S(): Sending the i_th slice of matrix () from host to device.

  • S(): Sending the i_th slice of matrix () from host to device.

  • S(): Sending a rectangular block of matrix () from host to device.

  • DGEMM: Vendor-supplied optimized in-core DGEMM invocation computing the matrix product .

  • R(): Sending the updated block of back from device to host.

To make sure data stored in device buffers is not overwritten until kernel executions that operate on the data have completed, events are created for each sub-matrix in and . As shown in the figure 4, represents recording the event associated with block , and makes the process wait for the event associated with block until it is recorded.

This software pipeline uses two streams and therefore performs very efficiently on Nvidia GPUs, which provide separate copy engines for data transfers between host and device, device and host, and engines for kernel invocation.

One can see that there is lot of synchronization code using events to overlap computations and communications correctly and effectively. However, this synchronization pattern is common and can be reused for out-of-core implementations of other data-parallel kernels. In our future work, we will consider design and development of a pattern language (a domain specific language (DSL)) that programmers can use to specify the software pipeline at higher level of abstraction. The compiler for this language will then generate all the boilerplate synchronization code thus removing this burden from the programmers.

Vi Experimental Results

Intel Haswell E5-2670V3
No. of cores per socket 12
Socket(s) 2
CPU MHz 1200.402
L1d cache, L1i cache 32 KB, 32 KB
L2 cache, L3 cache 256 KB, 30720 KB
Total main memory 64 GB DDR4
Memory bandwidth 68 GB/sec
NVIDIA K40c
No. of processor cores 2880
Total board memory 12 GB GDDR5
L2 cache size 1536 KB
Memory bandwidth 288 GB/sec
Intel Xeon Phi 3120P
No. of processor cores 57
Total main memory 6 GB GDDR5
Memory bandwidth 240 GB/sec
TABLE I: HCLServer1: Specifications of the Intel Haswell multicore CPU, Nvidia K40c, and Intel Xeon Phi 3120P.

We perform our experiments on two research servers, HCLServer1 and HCLServer2. HCLServer1 contains an Intel Haswell multicore CPU, Nvidia K40c GPU, and Intel Xeon Phi 3120P whose specifications are given in the Table I respectively. The FPGA in our server is the AlphaData ADM-PCIE-7V3 accelerator card. This device features a Xilinx Virtex 7 690T FPGA, with 16GB of DDR3 DRAM at 1333MT/s. The FPGA devices operates at a max TDP of 25W in sharp contrast to the other accelerators such as GPU and Xeon Phi with TDPs of 235W and 300W respectively. The OS on it is CentOS 7.2.1511.

Intel Xeon Gold 6152
Socket(s) 1
Cores per socket 22
L1d cache, L1i cache 32 KB, 32 KB
L2 cache, L3 cache 256 KB, 30976 KB
Main memory 96 GB
NVIDIA P100 PCIe
No. of processor cores 3584
Total board memory 12 GB CoWoS HBM2
Memory bandwidth 549 GB/sec
TABLE II: HCLServer2: Specifications of the Intel Skylake multicore CPU and Nvidia P100 PCIe.

HCLServer2 contains an Intel Skylake multicore CPU and Nvidia P100 PCIe GPU whose specifications are given in the Table II respectively. The OS on it is Ubuntu 16.04 LTS.

To obtain an experimental data point, the application is executed repeatedly until the sample mean lies in the 95% confidence interval and a precision of 0.025 (2.5%) has been achieved. For this purpose, Student’s t-test is used assuming that the individual observations are independent and their population follows the normal distribution. We verify the validity of these assumptions using Pearson’s chi-squared test. When we mention a single number such as floating-point performance (in TFLOPs), it is assumed that we are referring to the sample mean determined using the Student’s t-test.

We perform two sets of experiments. In the first set, we compare the performance of our MMOOC implementation written using libhclooc API (Figure 2) with the state-of-the-art accelerator-specific implementations [19] (ZZGemmOOC for Nvidia CUDA [22] and XeonPhiOOC for Intel Xeon Phi [23]) on HCLServer1. We do not report any results for libhclooc executing MMOOC on the FPGA since the basic in-card OpenCL matrix-matrix multiplication implementation is very poor.

In the second set, we compare the performance of our MMOOC implementation with the state-of-the-art accelerator-specific implementation (ZZGemmOOC for Nvidia CUDA [22]) and CUBLAS-XT [6] using the latest generation Nvidia P100 PCIe GPU on HCLServer2.

The performance (execution speed) of matrix multiplication for two dense matrices of sizes and is calculated as , where t is the execution time including the data transfers between host and device. The range of problem sizes () tested is . For Nvidia GPUs, libhclooc switches to out-of-core operation when exceeds 22528. For Intel Xeon Phi, libhclooc switches to out-of-core operation when exceeds 16384.

Figure (a)a shows the comparison of performances of MMOOC executed using the Nvidia K40c GPU on HCLServer1. The peak double-precision floating point performance of Nvidia K40c is 1.43 TFLOPs. The ZZGemmOOC implementation reached 1.16 TFLOPs at its peak. The peak double-precision floating point performance of libhclooc was 1.11 TFLOPs. The results therefore show libhclooc performing at 96% of the peak performance of the ZZGemmOOC implementation. There is also a 0% loss in performance when the program transitions to out-of-core execution. Both ZZGemmOOC and libhclooc implementations outperform the Nvidia’s CUBLAS-XT implementation (by more than 2.3x).

Figure (b)b shows the comparison of performances of MMOOC performed on the Intel Xeon Phi 3120P on HCLServer1. The peak double-precision floating point performance of libhclooc is 667 GFLOPs compared to 549 GFLOPs of the XeonPhiOOC implementation. One can see that libhclooc outperforms the XeonPhiOOC implementation.

(a)
(b)
(c)
Fig. 5: a). Comparison of performances of CUBLAS-XT, ZZGemmOOC, and libhclooc for MMOOC on Nvidia K40c on HCLServer1. b). Comparison of performances of XeonPhiOOC and libhclooc for MMOOC on Intel Xeon Phi 3120P on HCLServer1. c). Comparison of performances of CUBLAS-XT, ZZGemmOOC, and libhclooc for MMOOC on Nvidia P100 PCIe on HCLServer2. Green line shows the transition point from in-core to out-of-core execution.

It can be assumed that the majority of this loss in performance is due to the abstraction overhead. At least part of this overhead is due to the use of dynamic polymorphism within the library.

The erratic results of the Intel Xeon Phi could be attributed to a few things. When creating an offload stream for a Xeon Phi, it is required to set the number of threads this stream will use. In the libhclooc implementation with two streams, the threads are split in half, ideally with no overlap between threads assigned to each stream. The thread assignment behavior is set using environment variables, but the optimal settings are difficult to discern. What is optimal for smaller matrix sizes will cause the program to crash for larger matrix sizes or cause significant performance issues. Our testing revealed that the two stream implementation is simply not optimal for Intel Xeon Phis. A more optimal setup may involve one thread which would allow operations to utilize all the threads, running parallel calculations at full capacity.

XeonPhiOOC [23] now uses an implementation with a single stream instead of two. Its peak double-precision floating point performance of libhclooc is 725 GFLOPs. Therefore, libhclooc achieves 92% of the peak. When we reimplemented MMOOC using one stream, these performance drops were not observed. The results, therefore, suggest that the disparate capabilities provided for communication and computation overlap by accelerators must be taken into account when designing and developing out-of-core libraries.

The memory on the Intel Xeon Phi also exhibited some unexplained behavior. After one iteration of the out-of-core kernel, the reported available memory would not return to full capacity and sometimes drastically decreases after further iterations. The next time the program is run the memory is back to full availability. This seems to suggest that the reported available memory from the tool (Intel MPSS [24]) is incorrect. The experiments were run with an override that ignores the reported decrease in memory, as otherwise each iteration would have different size partitions due to differences in memory size.

Figure (c)c shows the comparison of performances performed on the Nvidia P100 GPU on HCLServer2. The peak double-precision floating point performance of Nvidia P100 is 4.7 TFLOPs. The accelerator-specific optimized ZZGemmOOC implementation reached 3.90 TFLOPs at its peak (83% of peak). The peak double-precision floating point performance of libhclooc is 3.5 TFLOPs. The results therefore show libhclooc performing at 90% of the peak performance of the ZZGemmOOC implementation. There is also a 0% loss in performance when the program transitions to out-of-core execution, which is shown by the green line in the graph. Both ZZGemmOOC and libhclooc implementations outperform the Nvidia’s CUBLAS-XT implementation (by more than 4x).

Implementing MMOOC using libhclooc API required 75% less lines of code (LOC). The interface in libhclooc is a lot more concise and straightforward, especially when compared to using Intel offload pragmas, fulfilling the objective of an easier-to-use interface.

Vii Conclusions and Future Work

Hardware accelerators are increasingly seeing utilisation by extreme-scale high performance computing (HPC), cloud, and Big data platforms to facilitate execution of workloads that demand high energy efficiency (high performance and low energy consumption). These accelerators have unique interfaces and programming models, and have limitations which must be addressed to facilitate execution of large workloads.

This paper presents an implementation of libhclooc, providing a uniform interface for CUDA streams and events, Intel offloads, OpenCL command queues and a reference implementation of efficient out-of-core matrix-matrix multiplication.

The uniform interface presents a more straightforward programming model. It is more concise, less complex, and easier to understand. Its functions are very explicit, allowing easier reasoning about the state of the accelerator. Using the interface allows programmers to write code to the interface and execute on three different types of accelerators, greatly reducing the amount of code required. 75% less code is written compared to the three state-of-the-art accelrator-specific implementations.

In terms of performance, the current implementation has an impact on peak performance of 10% for GPUs and 8% for Intel Xeon Phis when compared to state-of-the-art accelerator-specific implementations. The out-of-core matrix multiplication presented shows a 0% performance loss for when the workload exceeds the memory capacity of the device, the performance impact comes from the overhead of wrapping multiple interfaces. Our immediate goal is to find ways to reduce this abstraction overhead.

Intel has developed a new library for heterogeneous computing named hStreams [4]. We plan to add support for this new library in libhclooc to replace Intel offloads entirely.

We plan to add full support for level-3 BLAS kernels in our future work. Furthermore, we plan to provide out-of-core factorizations (LU, QR, Cholesky) that use the out-of-core matrix-matrix multiplication (DGEMM) as a fundamental building block.

We will also look at developing extensions of libhclooc for facilitating programming out-of-core implementations of accelerator kernels for multi-GPU platforms.

The software implementation of SummaGen presented in this paper is located at [25].

Acknowledgment

This publication has emanated from research conducted with the financial support of Science Foundation Ireland (SFI) under Grant Number 14/IA/2474.

References

  • [1] Top500, “Top 500. the list - november 2017,” 2017. [Online]. Available: https://www.top500.org/lists/2017/11/
  • [2] NVIDIA. (2016) CUDA C Programming Guide. [Online]. Available: https://docs.nvidia.com/cuda/cuda-c-programming-guide/
  • [3] Intel. (2017) Programming for Intel MIC architecture. [Online]. Available: https://software.intel.com/en-us/node/684368
  • [4] C. J. Newburn, G. Bansal, M. Wood, L. Crivelli, J. Planas, A. Duran, P. Souza, L. Borges, P. Luszczek, S. Tomov, J. Dongarra, H. Anzt, M. Gates, A. Haidar, Y. Jia, K. Kabir, I. Yamazaki, and J. Labarta, “Heterogeneous streaming,” in 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2016, pp. 611–620.
  • [5] Khronos OpenCL Registry. (2017) OpenCL Command Queues. [Online]. Available: https://www.khronos.org/registry/OpenCL/specs/opencl-2.2.pdf
  • [6] CUBLAS-XT, “CUBLAS-XT: Multi-GPU version of CUBLAS library supporting out-of-core routines,” 2018. [Online]. Available: https://developer.nvidia.com/cublas
  • [7] S. Tomov, J. Dongarra, and M. Baboulin, “Towards dense linear algebra for hybrid GPU accelerated manycore systems,” Parallel Computing, vol. 36, no. 5-6, pp. 232–240, Jun. 2010.
  • [8] J. Suzuki, Y. Hayashi, M. Kan, S. Miyakawa, T. Takenaka, T. Araki, and M. Kitsuregawa, “Victream: Computing framework for out-of-core processing on multiple gpus,” in Proceedings of the Fourth IEEE/ACM International Conference on Big Data Computing, Applications and Technologies, ser. BDCAT ’17.   ACM, 2017.
  • [9] L. Gu, J. Siegel, and X. Li, “Using GPUs to compute large out-of-card FFTs,” in Proceedings of the International Conference on Supercomputing, ser. ICS ’11.   ACM, 2011, pp. 255–264.
  • [10]

    X. Mu, H.-X. Zhou, K. Chen, and W. Hong, “Higher order method of moments with a parallel out-of-core lu solver on gpu/cpu platform,”

    IEEE Transactions on Antennas and Propagation, vol. 62, no. 11, pp. 5634–5646, 2014.
  • [11] Z. Zhong, V. Rychkov, and A. Lastovetsky, “Data partitioning on heterogeneous multicore and Multi-GPU systems using functional performance models of Data-Parallel applications,” in 2012 IEEE International Conference on Cluster Computing (Cluster 2012), 24-28 September 2012, pp. 191–199.
  • [12] Z. Zhong, “Optimization of Data-Parallel scientific applications on highly heterogeneous modern HPC platforms,” Ph.D. dissertation, University College Dublin, 2014.
  • [13] J. Wu and J. Jaja, “Achieving native GPU performance for out-of-card large dense matrix multiplication,” Parallel Processing Letters, vol. 26, no. 02, p. 1650007, 2016.
  • [14] A. Sabne, P. Sakdhnagool, and R. Eigenmann, “Scaling large-data computations on multi-gpu accelerators,” in Proceedings of the 27th International ACM Conference on International Conference on Supercomputing, ser. ICS ’13.   ACM, 2013.
  • [15] K. Shirahata, H. Sato, and S. Matsuoka, “Out-of-core gpu memory management for mapreduce-based large-scale graph processing,” in 2014 IEEE International Conference on Cluster Computing (CLUSTER), 2014.
  • [16] K. Kabir, A. Haidar, S. Tomov, A. Bouteiller, and J. Dongarra, “A framework for out of memory svd algorithms,” in High Performance Computing, J. M. Kunkel, R. Yokota, P. Balaji, and D. Keyes, Eds.   Springer International Publishing, 2017.
  • [17] A. Haidar, K. Kabir, D. Fayad, S. Tomov, and J. Dongarra, “Out of memory svd solver for big data,” in 2017 IEEE High Performance Extreme Computing Conference (HPEC), 2017, pp. 1–7.
  • [18] I. Yamazaki, S. Tomov, and J. Dongarra, “Non-gpu-resident symmetric indefinite factorization,” Concurr. Comput. : Pract. Exper., vol. 29, no. 5, Mar. 2017.
  • [19] H. Khaleghzadeh, Z. Zhong, R. Reddy, and A. Lastovetsky, “Out-of-core implementation for accelerator kernels on heterogeneous clouds,” The Journal of Supercomputing, vol. 74, no. 2, Feb 2018.
  • [20] R. A. van de Geijn and J. Watts, “SUMMA: scalable universal matrix multiplication algorithm,” Concurrency: Practice and Experience, vol. 9, no. 4, pp. 255–274.
  • [21] J. N. Quintin, K. Hasanov, and A. Lastovetsky, “Hierarchical parallel matrix multiplication on large-scale distributed memory platforms,” in 2013 42nd International Conference on Parallel Processing, Oct 2013.
  • [22] H. Khaleghzadeh, Z. Zhong, R. Reddy, and A. Lastovetsky., “ZZGemmOOC: Multi-GPU out-of-core routines for dense matrix multiplication,” 2017. [Online]. Available: https://git.ucd.ie/hcl/zzgemmooc.git
  • [23] H. Khaleghzadeh, R. Reddy, Z. Zhong, and A. Lastovetsky., “XeonPhiOOC: Out-of-core package for out-of-core DGEMM on Xeon Phi,” 2017. [Online]. Available: https://git.ucd.ie/manumachu/xeonphiooc.git
  • [24] I. Corporation, “Intel manycore platform software stack (intel MPSS) user’s guide,” 2018. [Online]. Available: http://registrationcenter.intel.com/irc_nas/8495/mpss_users_guide.pdf
  • [25] D. Hanlon and R. Reddy, “libhclooc: libhclooc: Software library facilitating out-of-core implementations of accelerator kernels on hybrid computing platforms,” 2018. [Online]. Available: git@git.ucd.ie:manumachu/libhclooc.git