Accelerating BLAS on Custom Architecture through Algorithm-Architecture Co-design

10/20/2016 ∙ by Farhad Merchant, et al. ∙ Nanyang Technological University 0

Basic Linear Algebra Subprograms (BLAS) play key role in high performance and scientific computing applications. Experimentally, yesteryear multicore and General Purpose Graphics Processing Units (GPGPUs) are capable of achieving up to 15 to 57 for compute bound operations like Double/Single Precision General Matrix Multiplication (XGEMM). For bandwidth bound operations like Single/Double precision Matrix-vector Multiplication (XGEMV) the performance is merely 5 to 7 Achieving performance in BLAS requires moving away from conventional wisdom and evolving towards customized accelerator tailored for BLAS through algorithm-architecture co-design. In this paper, we present acceleration of Level-1 (vector operations), Level-2 (matrix-vector operations), and Level-3 (matrix-matrix operations) BLAS through algorithm architecture co-design on a Coarse-grained Reconfigurable Architecture (CGRA). We choose REDEFINE CGRA as a platform for our experiments since REDEFINE can be adapted to support domain of interest through tailor-made Custom Function Units (CFUs). For efficient sequential realization of BLAS, we present design of a Processing Element (PE) and perform micro-architectural enhancements in the PE to achieve up-to 74 the theoretical peak performance of PE in DGEMM, 40 precision inner product (DDOT). We attach this PE to REDEFINE CGRA as a CFU and show the scalability of our solution. Finally, we show performance improvement of 3-140x in PE over commercially available Intel micro-architectures, ClearSpeed CSX700, FPGA, and Nvidia GPGPUs.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 11

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Several engineering and scientific applications require solution of dense linear systems of equations and linear least square problems where matrix factorizations like LU, QR and Cholesky methods play pivotal role. Traditionally, routines of these factorizations that are part of Linear Algebra Package (LAPACK) are written as a series of Basic Linear Algebra Subprogram (BLAS) calls [1][2]. Pictorial representation of double precision QR factorization routines, DGEQR2 and DGEQRF that are part of LAPACK is shown in the figure 1.

Fig. 1: DGEQR2 and DGEQRF Routines

In the pictorial representation of DGEQR2 it can be observed that DGEQR2 is dominated by matrix-vector operations (DGEMV in BLAS) and DGEQRF is dominated by DGEQR2 and matrix-matrix operations (DGEMM in BLAS). Our experiments for DGEQR2 on Intel Core i7 and observation using Intel VTune™suggests that for matrix of size , 99% of the total time DGEMV executes while double precsion inner product (DDOT) executes for hardly 1% of the total time in the operation of DGEQR2. Similarly, DGEQRF is dominated by DGEMM and it runs for 99% of the total time of DGEQRF while DGEQR2 runs for 1% of the time. Similar observations can be made in the routines like XGETRF (double/single precision LU factorization routine) and XPBTRF (double/single precision Cholesky factorization routine). Considering importance of BLAS in LAPACK, it is arguably one of the most interesting research problem to accelerate BLAS.

For acceleration of BLAS, a library based approach is adopted. Based on reference BLAS and LAPACK available on Netlib, Intel Math Kernel Library (MKL), IBM’s Engineering and Scientific Subroutine Library (ESSL), AMD’s AMD Core Math Library (ACML), Nvidia’s CUDA Linear Algebra (CULA) where CULA dense is for Dense Linear Algebra (DLA) and CULA sparse is for Sparse Linear Algebra (SLA), and cuBLAS which is yet another CUDA Baic Linear Algebra Subprograms are developed. There are also several open source packages for multores and General Purpose Graphics Processing Units (GPGPUs) based realizations like Parallel Linear Algebra Software for multicore Architectures (PLASMA) and Matrix Algebra on Multicore and GPGPU Architectures (MAGMA) use BLAS as a basic building block. All these mentioned packages are developed for multicore and GPGPUs for realization of DLA computations in the most efficient way. PLASMA and MAGMA incorporate tiled algorithms that are capable of exploiting memory hierarchy efficiently [3][4]. Despite all the efforts being directed towards acceleration of DLA computations, the performance attained by yesteryear platforms is as low as 15-17% of the theoretical peak performance in multicore and 55-57% of the theoretical peak performance in GPGPU at 65W and 240W power consumption respectively. Considering inability of GPGPU and multicore architectures in exploiting parallelism available in BLAS, we recommend algorithm-architecture co-design for BLAS as a solution for efficient realization of DLA. Performance of several recent realizations in detail is discussed in section 2.

Recently, Coarse-grained Reconfigurable Architectures (CGRAs) have gained popularity due to their power performance and flexibility [5][6][7]. Performance advantage in CGRAs is attained by supporting selected number of data-paths out of all possible data-paths and hence they occupy middle ground between Application Specific Integrated Circuits (ASICs) and Field Programmable Gate Arrays (FPGAs) [8][9][10][11]. CGRAs like REDEFINE have special feature that they can be customized for application domains where several data-paths belonging to a particular domain of interest are realized as a reconfigurable ASIC [12]. In REDEFINE, several Tiles are connected through a Network-on-Chip (NoC) where Custom Function Units (CFUs) tailored for a particular domain decides performance of overall system for application domain [12][13]. REDEFINE is shown in figure 11(k) along with several Tiles in a simulation environment.

Major contributions in this paper are as follows:

  • Firstly, we present evaluation of legacy BLAS on off-the-shelf Intel/AMD processor and Nvidia GPGPUs where Cycles-per-Instruction (CPI) and Gflops/watt (Giga flops per watt) based analysis is discussed. Through detailed experiments it is shown that with the best efforts the performance achieved for BLAS in Intel/AMD and Nvidia GPGPU is between 0.02 to 0.25 Gflops/watt

  • We present Directed Acyclic Graph (DAG) based analysis of representative routines of Level-1, Level-2, and Level-3 BLAS and discuss available parallelism and possible data locality in these routines. We also identify macro operations and realize them on a Reconfigurable Data-path (RDP). Based on our analysis, we arrive at design of a Processing Element (PE)

  • Several architectural enhancements are performed in the PE presented in [14] for improving throughput of BLAS by exploiting parallelism and data locality. These enhancements result in efficient realization of sequential BLAS in the PE. In this exposition, we have extended scope of experiments to accommodate matrix sizes of and to bring more clarity of saturation in the attained performance after each enhancement

  • It is shown that through algorithm-architecture co-design, we are able to break the saturation point with each enhancement and improve the overall performance of the BLAS in PE. With each enhancement, we show that we are able to push the saturation point towards theoretical peak performance of the PE at very high energy efficiency

  • We attach the PE to the Routers in REDEFINE for parallel realization of BLAS and show algorithmic and architecture scalability

The organization of the paper is as follows: In section 2, some of the recent Multi-core, GPGPU, and custom realizations of BLAS are discussed. In section 3, we present legacy BLAS realization on multicore and GPGPU and CPI and energy efficiency analysis of the realization. In section 4, DAG based analysis of Level-1, Level-2, and Level-3 BLAS is presented and we derive preliminary specifications of a PE. Architectural enhancements in the PE for improvement in throughput in BLAS and parallel realization of BLAS on REDEFINE where we attach PE as a CFU to REDEFINE are presented in section 5 and the work is summarized in section 6.

2 Related Work

Over the years there have been several efficient realization of BLAS due to applicability in high performance scientific application. In this section, we survey several multicore and GPU based, and custom realizations of BLAS. We consider FPGA based realizations as custom realizations.

2.1 Software Packages for Multicore Platforms

The first ever software library LINPACK for performing linear algebra computations was developed in 1970s and early 1980s [15]. Subsequently, LINPACK that used Level-1 BLAS as a basic building block was superseded by LAPACK that uses Level-3 BLAS as a basic building block [2]. In the recent years, with arrival of multicore architectures, there have been several advancements in the parallel realization of LAPACK. One such effort is PLASMA, that can perform computations on multicore architecture with the help of dynamic scheduler Queuing and Runtime for Kernels (QUARK). PLASMA creates pipeline model for parallel execution by dynamic scheduling of BLAS kernels on the multicore platform [16][3]. A Formal Linear Algebra Method Environment (FLAME) focuses on issues related to programming of linear algebra programs. The focus of the FLAME project is to automatically generate efficient linear algebra codes for the underlying platform [17][18]. Under the umbrella of FLAME project, BLAS-like Library Instantiation Software (BLIS) focuses on rapid scheduling of BLAS-like kernels on multicore architectures. Automatically Tuned Linear Algebra Software (ATLAS) is an matured open source package that generates BLAS for the underlying platform [19][20]. ATLAS relies on legacy BLAS for generation of efficient code for the underlying platform where several parameters are tweaked to suit the underlying platform. OpenBLAS is another open source package that focuses on efficient realization of DLA computations [21][22]. OpenBLAS relies on GotoBLAS for the performance where GotoBLAS is a set of assembly programs written for DLA computations. A major shortcoming of the packages like LAPACK, PLASMA, BLIS, ATLAS, and OpenBLAS is lack of support from the underlying platform resulting in 15-20% of the theoretical peak performance.

2.2 GPU Based Realizations

GPUs were originally designed for graphics processing are highly suitable for general purpose computing. There have been several packages developed to perform efficient BLAS functionality on GPUs. The most prominent of all of them is MAGMA software package [4]. MAGMA relies on MAGMA BLAS for the performance where the performance of MAGMA DGEMM is observed to be 57% of the peak performance of Tesla C2050 GPU with theoretical peak of 512 Gflops for double precision. KAUST BLAS (KBLAS) is one of the most recent and ongoing research project at KAUST. KBLAS internally relies on Cuda BLAS developed by Nvidia for the performance on GPU. BLASX presented in [23] focuses on optimizaiton of Level-3 BLAS in multi-GPU environment. BLASX minimizes the global communication through two level hierarchical tile cache structures and achieves 92.68% of the in-core cuBLAS DGEMM. BLASX also contains better load balancing techniques compared to MAGMA and cuBLAS. Despite elegant scheduling technique and efficient exploitation of memory hierarchy BLASX, the performance achieved by BLASX is limited by cuBLAS DGEMM. In [24], requirement for cache memory is studied in detail for achieving superliear speed-up for XGEMM. The study presented in [24] has no mention of data type if it is single precision or double precision. All the GPU based realization of BLAS fail to achieve high performance due to lack of support for GEMM primitives.

2.3 Custom Realizations

Customized accelerators are the class of architectures that are tuned for low energy, and unit area at high throughput for domain of interest [25][26]. Cell Broadband Engine (CBE) from International Business Machine (IBM) is a high performance architecture designed based on Power PC core [27]. Due to energy efficiency of CBE, it is viewed as an ideal platform for scientific computing [28]. ClearSpeed’s CSX architecture is back bone of ClearSpeed CSX600 and CSX700 processors. These processors have very high energy efficiency and operate at 12 Watts with theoretical peak of 96 GFlops [29][30][31]. A major shortcoming of ClearSpeed’s CSX and CBE architectures are low Gflops/W and Gflops/ [31].

There have been several attempts in viewing Field Programmable Gate Arrays (FPGAs) as high performance computing engines [32][33][34]. Mostly FPGAs are envisioned as a high performance co-processor of a programmable host processor for compute intensive applications [35][36]. A major disadvantage of FPGAs is higher power consumption than an Application Specific Integrated Circuit (ASIC) counterpart of the same logic. FPGAs are also limited by the on-chip logic resulting in scalability issues in high performance computing applications.

To overcome shortcomings of the existing architectures in exploiting parallelisms available in DLA computations, we take a route of algorithm-architecture co-design where we ensure high performance along with energy, area efficiency, and scalability.

3 BLAS Realization on Off-the-shelf Processors

GEMM (Level-3 BLAS) and GEMV (Level-2 BLAS) are the most prominent routines in many engineering and scientific computations. These routines also have pedagogical importance due to its simplistic nature and often used to evaluate emerging architectures. In this section, first we discuss GEMM algorithm and then we examine some of the recent realization of GEMM. Based on the anslysis of the profiling of the BLAS routines, we arrive at the pitfalls in extracting performance out of GEMM on contemporary multicore and GPU platforms. We further decide to design our own customized platform that is capable of extracting performance in BLAS through efficiently exploiting parallelism in BLAS.

3.1 GEMM and GEMV Algorithms

1:Allocate memories for input and output matrices and initialize input matrices
2:for  to  do
3:     for  to  do
4:         for  to  do
5:              C(i,j) = A(i,k)B(k,j) + C(i,j)
6:         end for
7:     end for
8:end for
Algorithm 1 GEMM - General Matrix Multiplication
1:Allocate memories for input matrix, input vector and output vector. Initialize input matrix and input vector
2:for  to  do
3:     for  to  do
4:         y(i) = A(i,j)x(j) + y(i)
5:     end for
6:end for
Algorithm 2 GEMV - Matrix Vector Multiplication

Pseudo codes for GEMM and GEMV are described in algorithms 1 and 2 respectively. GEMM algorithm has three loops and hence it belongs to Level-3 BLAS while GEMV belongs to Level-2 BLAS. For multiplying two matrices of size , it takes multiplications and additions while GEMV takes multiplications and additions. Typically, GEMM and GENV exhibit Instruction Level Parallelism (ILP) and Data Level Parallelism (DLP). GEMM also exhibits mighty data locality and is capable of sustaining computations to communication ratio. All together, if exploited efficiently and accelerated, GEMM and GEMV become an ideal candidate to be used as a basic building block for many high performance scientific applications. Since, GEMM has three nested loops, these loops can be permuted to change the access pattern of input matrices as shown in table I.

Loop Order Inner Loop Middle Loop Inner Loop Data Access
ijk A by row, B by column
jik A by row, B by column
ikj row B by row, C by row
jki column A by column, C by column
kij row outer product B by row, C by row
kji column outer product A by column, B by column
TABLE I: General Matrix Multiplication (GEMM): Loop Orderings and Access Patterns

In the table I, stands for ”scalar multiplied by vector plus vector ” and stands for generalized [37][38]. Further details of GEMM can be found in [37], and [38].

3.2 Performance Evaluation of GEMM and GEMV

For contemporary architecture, highly efficient GEMM is realized as a subroutine in BLAS. There exists several vendor specific realizations of DGEMM. For our experiments, we take DGEMM available in BLAS from The Netlib and for evaluation on GPU we use MAGMA_DGEMM. We compile DGEMM for different Intel and AMD machines with different compiler options and evaluate the performance of DGEMM for these architectures. We evaluate MAGMA_DGEMM on Telsa C2050.

(a) CPI in DGEMM on Intel Haswell and AMD Bulldozer Micro-architectures
(b) Gflops in DGEMM on Intel Haswell and AMD Bulldozer Micro-architectures
(c) CPI in DGEMM on Intel Haswell Micro-architecture with
(d) Gflops in DGEMM on Intel Haswell Micro-architecture with
(e) CPI in DGEMM on Intel Haswell Micro-architecture with and compiler switches
(f) Gflops in DGEMM on Intel Haswell Micro-architecture with and compiler switches
(g) Gflops in MAGMA_DGEMM and MAGMA_DGEMV on Nvidia Tesla C2050
(h) Gflops in DGQMV, DGEMM, MAGMA_DGEMM and MAGMA_DGEMV
(i) Gflops/watt in DGEMV, DGEMM, MAGMA_DGEMM and MAGMA_DGEMV on Nvidia Tesla C2050
Fig. 2: Performance of DGEMV and DGEMM on Different Micro-architectures

Figure 2(a) depicts CPI of DGEMM when executed on Intel Haswell and AMD Bulldozer micro-architectures. For experimental results shown in figure 2(a), we have used BLAS and CBLAS111CBLAS consists of C wrappers written around BLAS where BLAS is written in Fortran available in The Netlib and hence we have compiled BLAS and CBLAS using publicly available and . It can be observed in figure 2(a) that the CPI in the DGEMM saturates at around for Intel’s Haswell and AMD’s Bulldozer. For the matrices that fit in the cache achieve CPI that is lower than that for the matrices that do not fit in the cache. This is due to cache misses observed for larger matrices. While for the matrices that do not fit in the cache memory, attained CPI is slightly higher than the smaller matrices222Just to re-emphasize: In case of CPI, lower the better. For Intel Haswell and AMD Bulldozer, the lower bound of the CPI is . It can be observed that with DGEMM, which is highly optimized routine of BLAS, CPI achieved is nowhere close to the lower bound of CPI of the architecture. Similar trend is observed when we consider Gflops as a performance metric as shown in figure 2(b).

It can be observed in the figure 2(b) that, for the matrices that fit in the cache memory, the Gflops attained is higher. For the larger matrices that do not fit in the cache memory, Gflops decreases due to cache misses. While these architectures have peak performance of Gflops, attained performance is 10-11% of the peak performance.

One way to improve performance is to use the vendor specific compilers, since vendor specific compilers perform architecture aware optimizations in the programs. In order to further push the performance of DGEMM on Intel Haswell micro-architecture we use Intel C Compiler (icc) for compiling DGEMM routine in BLAS. Performance improvement in CPI and Gflops is shown in figures 2(c) and 2(d) respectively.

It can be observed in the figures 2(c) and 2(d) that the performance improvement in DGEMM is still far from the lower bound of the CPI and peak Gflops.

In the next set of experiments, we add compiler switch while compiling with . Performance improvement due to these switches is shown in the figure 2(e) and figure 2(f).

It can be observed from figures 2(e) and 2(f) that compiler switch improves performance and finally we are able to achieve 15-17% of the peak IPC (or CPI) and peak Gflops for DGEMM. Percentage of peak performance achieved is 4-5% for DGEMV and 55-57% for DGEMM in Tesla C2050 as depicted in figure 2(g) while percentage of peak performance achieved in Intel and Nvidia machines for DGEMV and DGEMM is ranging from 5% to 57% as shown in the figure 2(h). Considering Gfops/watt as a performance parameter, DGEMV and DGEMM in the BLAS achieve performance of 0.14 Gflops/watt and 0.25 Gflops/watt respectively while MAGMA_DGEMV and MAGMA_DGEMM achieve performance of 0.03 Gflops/watt to 0.225 Gflops/watt respectively as shown in figure 2(i). One more observation we make from the figure 2(e) and VTune™that use of compiler switch along with reduces number of instructions by half. This is because of use of FMA instructions in the generated assembly code. Reduction in the number of instructions leads to increase in CPI measured by VTune™. Though, there is an increase in CPI, performance in terms of Gflops is observed to be improved as shown in the figure 2(f). Hence, CPI measured by VTune™can not be considered as a correct measure for performance. We define terms Cycles-per-Flops (CPF) to be used instead CPI and Flops-per-Cycle (FPC) to be used instead IPC as follows:

(1)

We define FPC as follows:

(2)

CPF and FPC help us to evaluate performance of the architectures and algorithms more effectively. This is because the granularity of the compute resources considered in CPF and FPC is at the level of floating point operation and not at the level of Fused Multiply-add (FMA).

Across the experiments, we can observe that, significant efforts are needed to improve the performance of DGEMV and DGEMM on contemporary architectures and yet the attained performance is not satisfactory. We address this challenge of extracting performance from DLA computations through algorithm-architecture co-design in the subsequent sections of this paper.

4 Analysis of BLAS and CFU

In this section, we present graph based analysis of several Level-1, Level-2, and Level-3 BLAS. We choose several representative routines in all three levels of BLAS333Due to availability of double precision floating point unit, we consider only routines that are with prefix ”d”. For example first ”d” in represents double precision.

4.1 Vector Operations (Level-1 BLAS)

Level-1 BLAS typically has operations for a vector size of and data movement required is also . We analyze , , and operations here. Figure 3 represents inner product of two vectors given by equation 3.

(3)

where , and . DAG for is shown in figure 3 for .

Fig. 3: DAGs of , , and for

The routine has application in matrix-vector and matrix-matrix multiplication.

is given by equation 4 and DAG for is shown in figure 3.

(4)

is given by equation 5 and DAG for is given figure 3

(5)

It can be observed from the DAGs of and that the DAGs of these two routines are similar except presence of square root in routine. can be realized with same multiplier and adder resources as . It can also be observed in the figure 3 that the first level in the DAGs is multiplication and all these multiplications can potentially be executed in parallel. The next levels of DAGs of and are additions. Additions in each level of DAGs can be performed simultaneously if the inputs are available.

4.2 Matrix-vector Operations (Level-2 BLAS)

Matrix-vector operations are typically for input matrix of size and vector of size and data movement required is also . In our analysis we consider double precision matrix-vector multiplication (DGEMV) routine of BLAS. DGEMV is given in equation 6.

(6)

where is a matrix of size and and are vectors of size . DGEMV routine has multiplications, additions and additions to compute final vector . In our DAG analysis we consider matrix-vector multiplication since matrix-vector multiplication is compute intensive part of the routine and it is essential to exploit parallelism in matrix-vector multiplication to accelerate DGEMV routine. DAGs for matrix-vector multiplication are shown in in the figure 4.

Fig. 4: DAG of DGEMV for

It can be observed from DAGs of matrix-vector multiplication that all the multiplication in matrix-vector multiplication can be executed in parallel and matrix-vector multiplication can be realized as series of routine calls.

4.3 Matrix-matrix Operations (Level-3 BLAS)

Here we first review some of the matrix multiplication algorithms and present graph based analysis of these algorithms. We briefly discuss parallelism in different matrix multiplication algorithms. Based on the analysis, we choose matrix multiplication algorithm. In the following subsection of the paper, we present design of a PE that can efficiently exploit ILP in the chosen matrix multiplication algorithm.

4.3.1 Matrix Multiplication Algorithms

Over the years there have been several matrix multiplication algorithms proposed in the literature. In this subsection, we review and analyze Strassen’s Matrix Multiplication (SMM), Winograd’s Matrix Multiplication (WMM), and General Matrix Multiplication (GEMM). We consider , and as input matrices and as output matrix where are equal sized block matrices and .

4.3.2 Strassen’s Matrix Multiplication

SMM algorithm is described in table II for block matrix.

Level 1 Level 2 Level 3 Level 4
TABLE II: Computations in the different Levels of DAGs of SMM at first step of Recursion in Block Matrix Multiplication

Typically, SMM has two steps, 1) decompose step, and 2) merge step. In decompose step, matrix is divided in block matrices and to are computed. In merge step, to are computed. Directed Acyclic Graphs (DAGs) for SMM are shown in figure 5 for . It can be observed in the DAGs of SMM that the computation of to depends on computations of to .

Fig. 5: SMM, WMM, and GEMM for Block Matrix

This dependencies in the DAGs of SMM results in higher execution time of SMM. It can also be observed from DAGs of SMM in figure 5 that M1 to M7 can potentially be executed in parallel and C11 to C22 can also be executed in parallel. One of the way this parallelism can be exploited is through processor pipeline and pipelined arithmetic units. Asymptotic complexity of SMM is .

4.3.3 Winograd’s Matrix Multiplication

WMM algorithm operates on the same principle as SMM as shown below:

Level 1 Level 2 Level 3 Level 4 Level 5 Level 6
TABLE III: Computations in the different Levels of DAGs of WMM at first step of Recursion in Block Matrix Multiplication

It can be observed from WMM algorithm that it takes block matrix multiplications and matrix additions unlike SMM where the number of block matrix multiplication is same but the number of matrix additions are . DAGs for WMM are shown in figure 5 for .

WMM has same asymptotic complexity as SMM. In practical scenarios, execution time of WMM is observed to be slightly less than SMM due to fewer additions.

4.3.4 General Matrix Multiplication

GEMM for multiplication of and can be described by following expressions:

DAGs for a matrix is shown in figure 5. It can be observed from the DAGs of GEMM that it takes multiplications and additions. Asymptotic complexity of GEMM is .

SMM and WMM have lower asymptotic complexities compared to GEMM. A major disadvantage of SMM and WMM is that they are more suitable for square matrices where size is a power of two. For the matrix sizes where this condition is not met, a complex matrix partitioning scheme is required. Hence, we adopt GEMM over SMM and WMM due to following reasons:

  • A complex partitioning scheme required for the matrices in SMM and WMM results in a intricate scheduling scheme for the blocks of input matrices. A way to alleviate these complications is to pad the matrices. This padding results in few more computations, mostly . The padding does not reduce the complexity of the implementation since naive padding scheme is not efficient

  • GEMM has higher pedagogical importance than SMM and WMM. GEMM is highly preferred to evaluate the emerging architectures over SMM and WMM due to its simple structure and ease of implementation

4.3.5 Anatomy of General Matrix Multiplication

To discuss available parallelism in GEMM, we take a matrix multiplication of size as an example.

DAGs for for algorithm 1 are shown in figure 6 for computation of elements to . It can be observed in the figure 6 that all the multiplications in the block of the matrix can be computed in parallel. The dependencies are due to accumulation of the multiplied elements of the input matrices.

Fig. 6: DAGs in GEMM for Matrix

Potentially, in multiplication of matrix of size , all the multiplications can be computed in parallel. In case of matrix, all elements can be computed in parallel as shown in figure 6. The accumulation process while computing the elements of the resultant matrix enforce the dependencies resulting in pipeline stalls. These pipeline stalls can be avoided by computing multiple elements in parallel-pipeline manner.

1:Allocate memories for input and output matrices
2:for  to  do
3:     for  to  do
4:         for  to  do
5:              C = BLOCK4ADD(BLOCK4MUL(A,B),C)
6:         end for
7:     end for
8:end for
Algorithm 3 Block General Matrix Multiplication

Algorithm 3 depicts DGEMM with block matrix multiplication (assuming that the matrix dimensions are multiple of ). In algorithm 3, BLOCK4MUL is multiplication of matrices of size , and BLOCK4ADD is addition of matrices of size . A pitfall in the unrolling scheme is the exigency of locally available registers. Typically, for matrix, if fully unrolled, requires registers. Hence, for a large matrix, it can not be unrolled due to lack of locally available registers, but a small block of the matrix can be unrolled to exploit the fine grained parallelism in the block through processor pipeline and pipelined arithmetic units. In our experiments with PE explained in section 4.4, we have adopted a conservative approach where we have assumed space for intermediate results in the loacal registers and hence we have considered a block matirx with 64 registers of 64-bit wide. In parallel realization of GEMM, different blocks of can be computed in parallel on different PEs. While realizing GEMM on a single PE, we try to exploit parallelism that is available in a block of and in parallel realization on REDEFINE, we try to exploit parallelism across the blocks.

In the next section we present a PE design to skillfully exploit the parallelism that exist in the block of matrix.

4.4 Processing Element Design

For initial design of PE, we consider classical sequential architecture model. As a first design, we take a fully pipelined double precision floating point adder, and multiplier arithmetic units as compute resources. Architecture of PE is shown in figure 7.

Fig. 7: PE Architecture

As shown in the figure 7, the initial design of PE consists of an Instruction Memory, a Decoder to decode the instructions, a Register File with registers, and pipelined double precision Floating Point Unit (FPU) [39][40]. The FPU consists of a multiplier, an adder, a square root, and a divider. For computing matrix multiplication with large matrices, we choose as a block matrix. For the matrices that are not multiple of , we partition them in the blocks of as many times as possible and for the residual matrix, we perform unblocked multiplication. Operation of the PE can be described in the following steps:

  • Step 1: Bring the input matrices to the Register File that by sending Load request to the upper level of the memory

  • Step 2: Perform matrix multiplication

  • Step 3: Store back the resultant matrix to the upper level of memory

4.5 Simulation Environment and Initial Results

For simulations we connect PE shown in the figure 7 to external/global memory. Initially we use 64 ( wide) registers, 16KB of instruction memory for our experiments. We model global/external memory delay by placing pipelined delay of stages.

4.5.1 Initial Results

For our experiments, we choose matrix sizes , , , , and as a representative matrix sizes for our experiments.

Matrix Size Experimental Latencies CPF Gflops/W
39000 1.625 16.66
310075 1.614 16.87
1040754 1.606 17.15
2457600 1.6 17.25
4770000 1.59 17.38
TABLE IV: Latencies, CPF and Efficiency

It can be observed in the table IV that as we increase the matrix size, the CPF decreases and saturates around while performance in terms of Gflops/watt is observed to be at 17.3 Gflops/watt at 0.2 GHz. In other words, as we increase the matrix size, the FPC saturates at 62.5% of peak floating point operations per cycle. Here, since, we can potentially compute one multiplication and one addition in parallel, the peak FPC = 2.

In this section, we reviewed some of the matrix multiplication techniques. We discussed asymptotic complexity and graph based analysis of three different matrix multiplication algorithms. We justified our choice of GEMM over SMM and WMM algorithms. We presented additional details of GEMM with an example of matrix and proposed an initial design of a PE that achieves CPF of and performance of 17.3 Gflops/watt. Intuitively, performance of the PE can be improved methodically by enriching the PE with compute and memory resources.

5 Micro-architectural Enhancements in PE and Parallel Realization on REDEFINE

Based on anatomy of the GEMM and design of the PE presented in section 4.4, in this section we dwell on micro-architectural enhancements of the PE. We methodically enhance PE that improves CPF. The PE described in section 4.4 is considered as a Floating Point Sequencer in this section.

5.1 Introducing Local Memory and Load-Store CFU

Major drawbacks of the PE design presented in section 4.4 are no overlap of computations and communication and lack of exploitation of data locality in GEMM. To address these issues, we introduce a Load-Store CFU that operates simultaneously with FPS (depending on availability of data), and facilitates overlap of computation and communication. We also place a Local Memory (LM) of size kbits in the Load-Store CFU to exploit data locality in GEMM. Enhanced PE design with FPS and Load-Store CFU is shown in figure 8.

Fig. 8: PE with FPS and Load-Store CFU

Introduction of an LM in Load-Store CFU inside PE improves the data locality. This improved data locality creates further opportunities for exploitation of higher ILP by increasing compute resources in FPS. Increased compute resources in the FPS demand for improvisation in the data availability in the Register File for computations. In this section we present methodical architectural enhancements in the PE for reduction in latency444Latency in terms of clock cycles in execution of GEMM. These enhancements in the PE ensures lower latency in execution of GEMM leading to overlap between computation and communication up-to 90% in GEMM and we are also able to achieve up-to 74% of the peak CPF555Peak CPF =. To highlight the reduction in the latency due to each architectural enhancement, we consider , , , , and matrix sizes as a representative for our experiments. Reduction in the latency due to introduction of Load-Store CFU and LM (refer figure 8) is shown in the table V.

Matrix Size
Latency (in clock cycles) without LM 39000 312075 1040754 2457600 4770000
Latency (in clock cycles) with LM 23000 178471 595421 1410662 2730365
Improvement in Latency in terms of percentage 41% 42.5% 42.78% 42.6% 42.6%
Gflops/watt 14.87 15.53 15.77 15.81 15.98
TABLE V: Latencies of , , , , and GEMM (with LM and Load-Store CFU (PE-Architectural Enhancement (AE1))

It can be observed in the table V that introduction of LM in the Load-Store CFU improves performance by x and as we increase matrix size the performance improves due to improved data locality.

5.2 Special Instructions

In the first enhancement, we try to exploit higher ILP by increasing resources in the FPS that in turn improves performance significantly. This improved performance motivates us to improve data availability in the Register File residing inside FPS. We introduce two types of special instruction: 1) DOT instructions that are executed in FPS on a specialized fully pipelined hardware structure, and 2) Block Data Load and Block Data Store instructions that are executed in the Load-Store CFU.

5.2.1 DOT Instruction

Since we support block size of , we introduce a hardware that can perform inner product of a -element vector. The hardware structure to compute -element vector inner product is shown in figure 9. We further make this hardware structure reconfigurable to support -element and -element vector inner products to support different matrix sizes. We name this unit as a Reconfigurable Data-path (RDP). Through reconfiguration, RDP can be re-casted to perform macro operations encountered in some of the algorithms in BLAS discussed in the section 4. The RDP is shown in figure 9.

Fig. 9: RDP of DOT Instruction and different data-path derived by different configurations of DOT

For larger matrices () that do not fit in the Register File in FPS for matrix multiplication, we use block size of . For the matrices that do not have their size as multiple of , we use -element, and -element inner product configurations of RDP. In this exposition, we restrict our experiment to the matrices with size of multiple of and hence we use -element inner product configuration (also termed as DOT4 configuration) of RDP. DOT4 configuration of RDP has a -stage deep pipeline. Assuming no pipeline stalls, we can potentially maintain DOT4 instructions in a state of execution. This DOT4 instruction leads to exploitations of higher ILP in a block of in GEMM. Improvement in latency of GEMM due to DOT4 instruction is shown in table VI.

Matrix Size
Latency (in clock cycles) 15251 113114 371699 877124 1696921
Improvement over table V 33.7% 36.6% 37.57% 37.82% 37.85%
Gflops/watt 10.52 11.49 11.85 11.93 12.06
TABLE VI: Latencies of , , , , and GEMM (PE with Load-Store CFU, with DOT instruction (PE-Architectural Enhancement (AE2))

It can be observed from the table VI that as we increase the matrix size, the benefit due to DOT instruction improves. This is due to improved exploitation of ILP in the FPS.

5.2.2 Block Data Load and Block Data Store Instructions

We further aim to reduce handshaking between LM and GM. This reduction in the handshaking in-turn improves data availability in the Register File. In order to reduce the handshaking between PE and the next level of the Memory, we introduce instructions that can load/store data in a block fashion. Performance improvement due to Block Data Load and Block Data Store is shown in table VII where we have used as a block size for the transfer.

Matrix Size
Latency (in clock cycles) 12745 97136 324997 784838 1519083
Improvement over table VI 16.4% 14.1% 12.5% 10.51% 10.48%
Gflops/watt 12.59 13.38 13.56 13.33 13.47
TABLE VII: Latencies of , , , , and GEMM (PE with Load-Store CFU, with DOT4, and Block Data Load/Store instructions (PE-Architectural Enhancement (AE3))

It can be observed from the table VII that as we increase matrix size, the benefit due to Block Data Load/Store does not improve. Rather the performance is observed to be saturating. This is because of the constant block size of across all the matrix sizes. Supporting larger block size is not possible due to limited registers availability in the Register File in the FPS. It can also be observed in table VII that the latency gap between and , and and is also decreasing and it is likely to saturate at some point. Further experiments show that the gap saturates at for larger matrix sizes.

5.3 Bandwidth Increase

Increased resources in the FPS improves performance by almost x, and reduced handshaking between LM and the upper level of the memory improves performance by 10%. We still see significant gap between our desired performance and attained performance. The reason for this gap is mainly because of under utilization of the RDP that is configured as DOT4. In order to improve resource utilization of RDP, in this architectural enhancement, we increase bandwidth between FPS and Load-Store CFU to times. We consider increase in the bandwidth to times since the block size supported in FPS is . We transfer -bits between FPS and Load-Store CFU in contrast to previous realization where we transfered -bit data. The communication between FPS and Load-Store CFU at higher rate ensures better data availability in the Register File of FPS. The performance improvement due to increase in the bandwidth is shown in table VIII.

Matrix Size
Latency (in clock cycles) 7079 52624 174969 422924 818178
Improvement over table VII 44.4% 45.8% 46.1% 46.12% 46.14%
Gflops/watt 22.67 24.71 25.19 24.95 25.02
TABLE VIII: Latencies of , , , , and GEMM (PE with Load-Store CFU, with DOT and Block Data Load/Store instructions, increased bandwidth (PE-Architectural Enhancement (AE4))

It can be observed in table VIII that as we increase the matrix size the benefits due to increased bandwidth between FPS and Load-Store CFU in the PE improves. This is mainly because of better utilization of RDP (here configured as DOT4).

5.4 Pre-fetching

To improve the utilization of RDP further in the FPS, we restructure the loop in GEMM. We re-write algorithm 1 as algorithm 4.

1:Allocate memories for input and output matrices
2:for  to  do
3:     for  to  do
4:         
5:         for  to  do
6:              
7:         end for
8:     end for
9:end for
Algorithm 4 General Matrix Multiplication with Pre-fetching
Fig. 10: Pre-fetching Matrix Elements for the Next Iteration [14]

Re-structuring the loops in the algorithm allows us to pre-fetch the matrix block required for the next iteration. This results in better exploitation of FPU pipeline by reduced instruction stalls in FPS as shown in figure 10. The shaded portion in the figure 10 depicts the reduction in the instruction stalls in FPS when there is a pre-fetch of the block of the matrix required in the next iteration for computation. In figure 10, there are two portions, 1) before pre-fetching, and 2) after pre-fetching. Arrows in the figures depict execution of different types of operations such as computations in FPS, loading/storing of data from/to GM (or EM) memory, loading/storing of data from/to GM.

Matrix Size
Latency (in clock cycles) 5561 38376 124741 298161 573442
Improvement over table VIII 21.44% 27.07% 28.70% 29.5% 29.9%
Gflops/watt 28.86 33.88 35.33 35.11 35.70
TABLE IX: Latencies of , , , , and GEMM (PE with Load-Store CFU, with DOT and Block Data Load/Store instructions, increased bandwidth and data pre-fetching (PE-Architectural Enhancement (AE5))

Improvement attained by pre-fetching is shown in table IX. It can be observed in the table IX that as we increase matrix size the benefits due to pre-fetching increases. This is mainly because of improvement in data availability in the Register File of the FPS.

(a) Reduction in the Latency in DGEMM Due to Architecture Enhancements
(b) Latencies Normalized to Total Computations in terms of DOT4 in DGEMM
(c) Cycles per Floating Point Operation in DGEMM
(d) Floating Point Operations per Cycle in DGEMM
(e) Percentage of Peak FPC in DGEMM with Each Architectural Enhancement
(f) Percentage of Peak FPC in DGEMV with Each Architectural Enhancement
(g) Percentage of Peak FPC in DDOT with Each Architectural Enhancement
(h) Gflops/watt for DGEMM at 0.2GHz, 0.33GHz, 0.95GHz, and 1.81GHz
(i) Gflops/mm for DGEMM at 0.2GHz, 0.33GHz, 0.95GHz, and 1.81GHz
(j) Performance Comparison of REDEFINE-PE with Other Platforms
(k) Simulation Environment where Tile array of is used for realizing DGEMM
Fig. 11: Performance of DGEMM

Collectively, the reduction in execution cycles of GEMM can be seen in the figure 11(a) as we perform different architectural enhancements in the PE. It can be clearly observed that finally we get speed-up of 7x for matrix of size , 8.13x for the matrix of size , and 8.34x for the matrix of size [14].

It can be observed from figure 11(b) that as we perform different architectural enhancements, the ratio of Latency to computations reduces. If we denote the ratio of Latency to total computations by then

(7)

It can be observed in the figure 11(b) that as we increase matrix size asymptotically approaches . is a case where there is a complete overlap of computation and communication. Complete overlap of computations and communication is not possible in the real life scenario and hence can never become .

Figure 11(c) depicts CPF for matrices of size , , and . It can be observed in the figure 11(c) that as we perform enhancements the CPF tends to decrease and this trend is observed across all the matrices. Figure 11(d) depicts FPC (where FPC = 1/CPF). It can be observed from figure 11(d) that, as we perform enhancements in the PE, FPC improves dramatically. Although CPF and FPC are good measure of performance of a PE, they do not convey enough information about how efficiently compute resources are utilized in the PE.

Percentage of peak FPC attained in PE after every enhancement is shown in figure 11(e). Figure 11(e) depicts an interesting trend where the peak FPC reduces drastically and then with further architectural enhancements, improves. As our enhancements suggests, in the first enhancement where we place Load-Store CFU with LM for overlap of computation and communication, the FPC achieved saturates at 54% of the peak FPC666Here peak FPC = . We strongly intend to break this saturation point since 54% of the peak is not a satisfactory performance. In order to break this saturation point we further enhance FPS with compute resources that leads to higher theoretical peak FPC777Here peak FPC = . Increase in peak FPC is due to DOT4 instruction. At this point achieved FPC reduces at as shown in the figure 11(e). This is because of increased compute resources. Our further enhancements help us to improve the resource utilization of the increased resources in FPS and achieve up-to 74% of the peak FPC of the PE.

We presented methodical architectural enhancements to improve the performance of PE. Through architectural customizations, we could break saturation point at 54% and improve performance of PE. In other words, we showed that the performance of the algorithms can be improved by customizations that are specific to the algorithms. Here, we showed this with example of GEMM which is a Level-3 BLAS. Finally, 35.7 Gflops/watt in the PE is achieved through carefull realization of Level-3 BLAS.

5.5 Parallel Realization of Level-3 BLAS

For parallel realization of DGEMM, our two different simulation environments are shown in figure 11(k). In figure 11(k), shaded portion of the Tile array (except the last column) which is Tiles is used for the computations where we realize DGEMM while the last column is used for storing input and output matrices. We use Octave for generating input matrices. Similarly, we use portion of the Tile array for realizing DGEMM.

In our experiments, if output matrix is of size then we divide the output matrix into blocks of where is the Tile array that we are using. In our experiments or . For example, if output matrix is of size , and we are using of the Tile array to compute the DGEMM, we divide output matrices into block matrices. Now, in each Tile that we are using, we compute one of the block of size . Similarly, if output matrix is of size , and Tile array that is used for computing DGEMM is of size then block of is computed in each of the Tile of REDEFINE.

Fig. 12: Speed-up in REDEFINE for DGEMM for Different Configurations

Speed-up attained in REDEFINE is is shown in the figure 12 when Tile array of size , , and are used for the experiments. It can be observed from the figure 12 that when we use Tile array of the speed-up over PE realization approaches as we increase matrix size. When we use Tile array of the speed-up over PE realization approaches , and for Tile array of size the speed-up attained approaches 16. For small matrices, communication with the last column of the Tile array is dominant over computations in the Tile. For example, for a matrix of size and Tile array size of , each Tile computes block of the resultant matrix. Ignoring the coefficient of the highest order term, there will be computations over loads/stores. Computation to communication ratio in for matrix will be in each Tile. For matrix a of size where each Tile will compute a block of matrix, computation to communication ratio is . One more observation we make here is that, as we increase the matrix size, speed-up in REDEFINE over PE saturates. This is because of the saturation in the parallelism exploited by the PE that is attached in each Tile of REDEFINE.

In this section, we presented results for the PE that we presented in section 4.4

. We use the estimation methodology presented in

[31], [41], and [26] for fair comparison of the platforms. As shown in the figure 11(j), it can be observed that the performance of the PE is 40-140x better than Intel Core architectures while 7-139x better than Nvidia GPUs. Compared to Altera FPGA, PE is 10x better in terms of Gflops/watt while compared to ClearSpeed CSX700 it is almost 3x better.

6 Conclusion

While the recent realizations for matrix computations focus on architectural customization for DLA, in this paper we presented a novel way of algorithm-architecture co-design for breaking the performance saturation point in BLAS and presented a systematic enhancements in the micro-architecture for exploiting underlying compute and memory resources more efficiently. In algorithm-architecture co-design, we first realized Level-3 BLAS on off-the-shelf processors and exhibited that the performance achieved by these off-the-shelf processors in DGEMM is not satisfactory. The performance in the off-the-shelf Intel and AMD processors saturates at 15-17% at 65W and 57% in Nvidia Tesla C2050 GPGPU with the best optimization efforts. This dis-satisfactory performance is mainly due to inefficiently exploited compute and memory resources of the underlying platform. We could sense a scope here in breaking this performance saturation point in a custom architecture and performed analysis of BLAS, and designed a PE that could efficiently execute BLAS routines at much higher Gflops/watt. We further enhanced this PE with several features such that we could efficiently exploit compute resources and memory resources in the PE and achieve performance of 35.7 Gflops/watt that is much higher than the off-the-shelf Intel and AMD processors and GPGPUs. We attached this PE in REDEFINE for parallel realization of BLAS and showed that the speed-up achieved in the parallel realization is commensurate with the number of Tiles used and hence we showed that our solution is scalable.

References

  • [1] E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen, LAPACK Users’ Guide, 3rd ed.   Philadelphia, PA: Society for Industrial and Applied Mathematics, 1999.
  • [2] E. Anderson, Z. Bai, J. Dongarra, A. Greenbaum, A. McKenney, J. Du Croz, S. Hammerling, J. Demmel, C. Bischof, and D. Sorensen, “Lapack: A portable linear algebra library for high-performance computers,” in Proceedings of the 1990 ACM/IEEE Conference on Supercomputing, ser. Supercomputing ’90.   Los Alamitos, CA, USA: IEEE Computer Society Press, 1990, pp. 2–11. [Online]. Available: http://dl.acm.org/citation.cfm?id=110382.110385
  • [3] J. Kurzak, P. Luszczek, A. YarKhan, M. Faverge, J. Langou, H. Bouwmeester, and J. Dongarra, “Multithreading in the plasma library,” 2013.
  • [4] B. J. Smith, “R package magma: Matrix algebra on gpu and multicore architectures, version 0.2.2,” September 3, 2010, [On-line] http://cran.r-project.org/package=magma.
  • [5] N. Farahini, S. Li, M. Tajammul, M. Shami, G. Chen, A. Hemani, and W. Ye, “39.9 gops/watt multi-mode cgra accelerator for a multi-standard basestation,” in Circuits and Systems (ISCAS), 2013 IEEE International Symposium on, May 2013, pp. 1448–1451.
  • [6] F. Bouwens, M. Berekovic, A. Kanstein, and G. Gaydadjiev, “Architectural exploration of the adres coarse-grained reconfigurable array,” in Proceedings of the 3rd International Conference on Reconfigurable Computing: Architectures, Tools and Applications, ser. ARC’07.   Berlin, Heidelberg: Springer-Verlag, 2007, pp. 1–13. [Online]. Available: http://dl.acm.org/citation.cfm?id=1764631.1764633
  • [7]

    F. A. G. N. S K Nandy, Ranjani Narayan, “Efficient QR Decomposition Using Low Complexity Column-wise Givens Rotation (CGR),”

    VLSI Design 2014, pp. 258–263.
  • [8] J. York and D. Chiou, “On the asymptotic costs of multiplexer-based reconfigurability,” in The 49th Annual Design Automation Conference 2012, DAC ’12, San Francisco, CA, USA, June 3-7, 2012, 2012, pp. 790–795. [Online]. Available: http://doi.acm.org/10.1145/2228360.2228503
  • [9] G. Ansaloni, P. Bonzini, and L. Pozzi, “Egra: A coarse grained reconfigurable architectural template,” Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, vol. 19, no. 6, pp. 1062–1074, June 2011.
  • [10] Z. E. Rákossy, F. Merchant, A. A. Aponte, S. K. Nandy, and A. Chattopadhyay, “Efficient and scalable cgra-based implementation of column-wise givens rotation,” in ASAP, 2014, pp. 188–189.
  • [11] Z. E. Rákossy, F. Merchant, A. A. Aponte, S. K. Nandy, and A. Chattopadhyay, “Scalable and energy-efficient reconfigurable accelerator for column-wise givens rotation,” in 22nd International Conference on Very Large Scale Integration, VLSI-SoC, Playa del Carmen, Mexico, October 6-8, 2014, 2014, pp. 1–6. [Online]. Available: http://dx.doi.org/10.1109/VLSI-SoC.2014.7004166
  • [12] M. Alle, K. Varadarajan, A. Fell, R. R. C., N. Joseph, S. Das, P. Biswas, J. Chetia, A. Rao, S. K. Nandy, and R. Narayan, “REDEFINE: Runtime reconfigurable polymorphic asic,” ACM Trans. Embed. Comput. Syst., vol. 9, no. 2, pp. 11:1–11:48, Oct. 2009. [Online]. Available: http://doi.acm.org/10.1145/1596543.1596545
  • [13] J. Nimmy, C. R. Reddy, K. Varadarajan, M. Alle, A. Fell, S. K. Nandy, and R. Narayan, “Reconnect: A noc for polymorphic asics using a low overhead single cycle router,” in ASAP, 2008, pp. 251–256.
  • [14] F. Merchant, A. Maity, M. Mahadurkar, K. Vatwani, I. Munje, M. Krishna, S. Nalesh, N. Gopalan, S. Raha, S. Nandy, and R. Narayan, “Micro-architectural enhancements in distributed memory cgras for lu and qr factorizations,” in VLSI Design (VLSID), 2015 28th International Conference on, Jan 2015, pp. 153–158.
  • [15] J. Dongarra, “The linpack benchmark: An explanation,” in Proceedings of the 1st International Conference on Supercomputing.   London, UK, UK: Springer-Verlag, 1988, pp. 456–474. [Online]. Available: http://dl.acm.org/citation.cfm?id=647970.742568
  • [16] A. YarKhan, J. Kurzak, and J. Dongarra, “Quark users’ guide: Queueing and runtime for kernels,” Innovative Computing Laboratory, University of Tennessee, Tech. Rep., 2011.
  • [17] F. G. Van Zee, libflame: The Complete Reference.   lulu.com, 2009.
  • [18] P. Bientinesi, J. Gunnels, M. Myers, E. Quintana-Ortí, T. Rhodes, R. van de Geijn, and F. Van Zee, “Deriving dense linear algebra libraries,” Formal Aspects of Computing, vol. 25, no. 6, pp. 933–945, 2013. [Online]. Available: http://dx.doi.org/10.1007/s00165-011-0221-4
  • [19] R. C. Whaley and A. Petitet, “Minimizing development and maintenance costs in supporting persistently optimized BLAS,” Software: Practice and Experience, vol. 35, no. 2, pp. 101–121, February 2005, http://www.cs.utsa.edu/~whaley/papers/spercw04.ps.
  • [20] R. C. Whaley, A. Petitet, and J. J. Dongarra, “Automated empirical optimization of software and the ATLAS project,” Parallel Computing, vol. 27, no. 1–2, pp. 3–35, 2001, also available as University of Tennessee LAPACK Working Note #147, UT-CS-00-448, 2000 (www.netlib.org/lapack/lawns/lawn147.ps).
  • [21] Q. Wang, X. Zhang, Y. Zhang, and Q. Yi, “Augem: Automatically generate high performance dense linear algebra kernels on x86 cpus,” in Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, ser. SC ’13.   New York, NY, USA: ACM, 2013, pp. 25:1–25:12. [Online]. Available: http://doi.acm.org/10.1145/2503210.2503219
  • [22] Z. Xianyi, W. Qian, and Z. Yunquan, “Model-driven level 3 blas performance optimization on loongson 3a processor,” in Parallel and Distributed Systems (ICPADS), 2012 IEEE 18th International Conference on, Dec 2012, pp. 684–691.
  • [23] L. Wang, W. Wu, Z. Xu, J. Xiao, and Y. Yang, “BLASX: A high performance level-3 BLAS library for heterogeneous multi-gpu computing,” in Proceedings of the 2016 International Conference on Supercomputing, ICS 2016, Istanbul, Turkey, June 1-3, 2016, 2016, pp. 20:1–20:11. [Online]. Available: http://doi.acm.org/10.1145/2925426.2926256
  • [24] L. Djinevski, S. Ristov, and M. Gusev, “Superlinear speedup for matrix multiplication in gpu devices,” in ICT Innovations 2012.   Springer Berlin Heidelberg, 2013, pp. 285–294.
  • [25] F. Merchant, T. Vatwani, A. Chattopadhyay, S. Raha, S. K. Nandy, and R. Narayan, “Achieving efficient qr factorization by algorithm-architecture co-design of householder transformation,” in 29th International Conference on VLSI Design, VLSID 2016, Kolkata, India, January 4-8, 2016.
  • [26] A. Pedram, A. Gerstlauer, and R. A. van de Geijn, “Algorithm, architecture, and floating-point unit codesign of a matrix factorization accelerator,” IEEE Trans. Computers, vol. 63, no. 8, pp. 1854–1867, 2014. [Online]. Available: http://dx.doi.org/10.1109/TC.2014.2315627
  • [27] T. Chen, R. Raghavan, J. Dale, and E. Iwata, “Cell broadband engine architecture and its first implementation x2014;a performance view,” IBM Journal of Research and Development, vol. 51, no. 5, pp. 559–572, Sept 2007.
  • [28] S. Williams, J. Shalf, L. Oliker, S. Kamil, P. Husbands, and K. Yelick, “Scientific computing kernels on the cell processor,” Int. J. Parallel Program., vol. 35, no. 3, pp. 263–298, Jun. 2007. [Online]. Available: http://dx.doi.org/10.1017/s10766-007-0034-5
  • [29] Y. Nishikawa, M. Koibuchi, M. Yoshimi, A. Shitara, K. Miura, and H. Amano, “Performance analysis of clearspeed’s CSX600 interconnects,” in IEEE International Symposium on Parallel and Distributed Processing with Applications, ISPA 2009, Chengdu, Sichuan, China, 10-12 August 2009, 2009, pp. 203–210. [Online]. Available: http://doi.ieeecomputersociety.org/10.1109/ISPA.2009.102
  • [30] Y. Nishikawa, M. Koibuchi, M. Yoshimi, K. Miura, and H. Amano, “Performance improvement methodology for clearspeed’s CSX600,” in 2007 International Conference on Parallel Processing (ICPP 2007), September 10-14, 2007, Xi-An, China, 2007, p. 77. [Online]. Available: http://dx.doi.org/10.1109/ICPP.2007.66
  • [31] A. Pedram, S. Z. Gilani, N. S. Kim, R. A. van de Geijn, M. J. Schulte, and A. Gerstlauer, “A linear algebra core design for efficient level-3 blas,” in ASAP, 2012, pp. 149–152.
  • [32] J. Gonzalez and R. C. Núñez, “Lapackrc: Fast linear algebra kernels/solvers for fpga accelerators,” Journal of Physics: Conference Series, vol. 180, no. 1, p. 012042, 2009. [Online]. Available: http://stacks.iop.org/1742-6596/180/i=1/a=012042
  • [33] Y.-G. Tai, C.-T. D. Lo, and K. Psarris, “Scalable matrix decompositions with multiple cores on {FPGAs},” Microprocessors and Microsystems, vol. 37, no. 8, Part B, pp. 887 – 898, 2013, embedded Multicore Systems: Architecture, Performance and Application. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0141933112001007
  • [34] S. Kestur, J. D. Davis, and O. Williams, “Blas comparison on fpga, cpu and gpu,” in Proceedings of the 2010 IEEE Annual Symposium on VLSI, ser. ISVLSI ’10.   Washington, DC, USA: IEEE Computer Society, 2010, pp. 288–293. [Online]. Available: http://dx.doi.org/10.1109/ISVLSI.2010.84
  • [35] L. Zhuo and V. K. Prasanna, “Scalable and modular algorithms for floating-point matrix multiplication on reconfigurable computing systems,” IEEE Trans. Parallel Distrib. Syst., vol. 18, no. 4, pp. 433–448, Apr. 2007. [Online]. Available: http://dx.doi.org/10.1109/TPDS.2007.1001
  • [36] P. Zicari, P. Corsonello, S. Perri, and G. Cocorullo, “A matrix product accelerator for field programmable systems on chip,” Microprocess. Microsyst., vol. 32, no. 2, pp. 53–67, Mar. 2008. [Online]. Available: http://dx.doi.org/10.1016/j.micpro.2007.05.002
  • [37] G. H. Golub and C. F. Van Loan, Matrix computations (3rd ed.).   Baltimore, MD, USA: Johns Hopkins University Press, 1996.
  • [38] N. J. Higham, Accuracy and Stability of Numerical Algorithms, 2nd ed.   Philadelphia, PA, USA: Society for Industrial and Applied Mathematics, 2002.
  • [39]

    F. Merchant, N. Choudhary, S. K. Nandy, and R. Narayan, “Efficient realization of table look-up based double precision floating point arithmetic,” in

    29th International Conference on VLSI Design, VLSID 2016, Kolkata, India, January 4-8, 2016.
  • [40] F. Merchant, A. Chattopadhyay, S. Raha, S. K. Nandy, and R. Narayan, “Accelerating BLAS and LAPACK via efficient floating point architecture design,” CoRR, vol. abs/1610.08705, 2016. [Online]. Available: http://arxiv.org/abs/1610.08705
  • [41] A. Pedram, R. A. van de Geijn, and A. Gerstlauer, “Codesign tradeoffs for high-performance, low-power linear algebra architectures,” IEEE Trans. Computers, vol. 61, no. 12, pp. 1724–1736, 2012. [Online]. Available: http://doi.ieeecomputersociety.org/10.1109/TC.2012.132