1 Introduction
Several engineering and scientific applications require solution of dense linear systems of equations and linear least square problems where matrix factorizations like LU, QR and Cholesky methods play pivotal role. Traditionally, routines of these factorizations that are part of Linear Algebra Package (LAPACK) are written as a series of Basic Linear Algebra Subprogram (BLAS) calls [1][2]. Pictorial representation of double precision QR factorization routines, DGEQR2 and DGEQRF that are part of LAPACK is shown in the figure 1.
In the pictorial representation of DGEQR2 it can be observed that DGEQR2 is dominated by matrixvector operations (DGEMV in BLAS) and DGEQRF is dominated by DGEQR2 and matrixmatrix operations (DGEMM in BLAS). Our experiments for DGEQR2 on Intel Core i7 and observation using Intel VTune™suggests that for matrix of size , 99% of the total time DGEMV executes while double precsion inner product (DDOT) executes for hardly 1% of the total time in the operation of DGEQR2. Similarly, DGEQRF is dominated by DGEMM and it runs for 99% of the total time of DGEQRF while DGEQR2 runs for 1% of the time. Similar observations can be made in the routines like XGETRF (double/single precision LU factorization routine) and XPBTRF (double/single precision Cholesky factorization routine). Considering importance of BLAS in LAPACK, it is arguably one of the most interesting research problem to accelerate BLAS.
For acceleration of BLAS, a library based approach is adopted. Based on reference BLAS and LAPACK available on Netlib, Intel Math Kernel Library (MKL), IBM’s Engineering and Scientific Subroutine Library (ESSL), AMD’s AMD Core Math Library (ACML), Nvidia’s CUDA Linear Algebra (CULA) where CULA dense is for Dense Linear Algebra (DLA) and CULA sparse is for Sparse Linear Algebra (SLA), and cuBLAS which is yet another CUDA Baic Linear Algebra Subprograms are developed. There are also several open source packages for multores and General Purpose Graphics Processing Units (GPGPUs) based realizations like Parallel Linear Algebra Software for multicore Architectures (PLASMA) and Matrix Algebra on Multicore and GPGPU Architectures (MAGMA) use BLAS as a basic building block. All these mentioned packages are developed for multicore and GPGPUs for realization of DLA computations in the most efficient way. PLASMA and MAGMA incorporate tiled algorithms that are capable of exploiting memory hierarchy efficiently [3][4]. Despite all the efforts being directed towards acceleration of DLA computations, the performance attained by yesteryear platforms is as low as 1517% of the theoretical peak performance in multicore and 5557% of the theoretical peak performance in GPGPU at 65W and 240W power consumption respectively. Considering inability of GPGPU and multicore architectures in exploiting parallelism available in BLAS, we recommend algorithmarchitecture codesign for BLAS as a solution for efficient realization of DLA. Performance of several recent realizations in detail is discussed in section 2.
Recently, Coarsegrained Reconfigurable Architectures (CGRAs) have gained popularity due to their power performance and flexibility [5][6][7]. Performance advantage in CGRAs is attained by supporting selected number of datapaths out of all possible datapaths and hence they occupy middle ground between Application Specific Integrated Circuits (ASICs) and Field Programmable Gate Arrays (FPGAs) [8][9][10][11]. CGRAs like REDEFINE have special feature that they can be customized for application domains where several datapaths belonging to a particular domain of interest are realized as a reconfigurable ASIC [12]. In REDEFINE, several Tiles are connected through a NetworkonChip (NoC) where Custom Function Units (CFUs) tailored for a particular domain decides performance of overall system for application domain [12][13]. REDEFINE is shown in figure 11(k) along with several Tiles in a simulation environment.
Major contributions in this paper are as follows:

Firstly, we present evaluation of legacy BLAS on offtheshelf Intel/AMD processor and Nvidia GPGPUs where CyclesperInstruction (CPI) and Gflops/watt (Giga flops per watt) based analysis is discussed. Through detailed experiments it is shown that with the best efforts the performance achieved for BLAS in Intel/AMD and Nvidia GPGPU is between 0.02 to 0.25 Gflops/watt

We present Directed Acyclic Graph (DAG) based analysis of representative routines of Level1, Level2, and Level3 BLAS and discuss available parallelism and possible data locality in these routines. We also identify macro operations and realize them on a Reconfigurable Datapath (RDP). Based on our analysis, we arrive at design of a Processing Element (PE)

Several architectural enhancements are performed in the PE presented in [14] for improving throughput of BLAS by exploiting parallelism and data locality. These enhancements result in efficient realization of sequential BLAS in the PE. In this exposition, we have extended scope of experiments to accommodate matrix sizes of and to bring more clarity of saturation in the attained performance after each enhancement

It is shown that through algorithmarchitecture codesign, we are able to break the saturation point with each enhancement and improve the overall performance of the BLAS in PE. With each enhancement, we show that we are able to push the saturation point towards theoretical peak performance of the PE at very high energy efficiency

We attach the PE to the Routers in REDEFINE for parallel realization of BLAS and show algorithmic and architecture scalability
The organization of the paper is as follows: In section 2, some of the recent Multicore, GPGPU, and custom realizations of BLAS are discussed. In section 3, we present legacy BLAS realization on multicore and GPGPU and CPI and energy efficiency analysis of the realization. In section 4, DAG based analysis of Level1, Level2, and Level3 BLAS is presented and we derive preliminary specifications of a PE. Architectural enhancements in the PE for improvement in throughput in BLAS and parallel realization of BLAS on REDEFINE where we attach PE as a CFU to REDEFINE are presented in section 5 and the work is summarized in section 6.
2 Related Work
Over the years there have been several efficient realization of BLAS due to applicability in high performance scientific application. In this section, we survey several multicore and GPU based, and custom realizations of BLAS. We consider FPGA based realizations as custom realizations.
2.1 Software Packages for Multicore Platforms
The first ever software library LINPACK for performing linear algebra computations was developed in 1970s and early 1980s [15]. Subsequently, LINPACK that used Level1 BLAS as a basic building block was superseded by LAPACK that uses Level3 BLAS as a basic building block [2]. In the recent years, with arrival of multicore architectures, there have been several advancements in the parallel realization of LAPACK. One such effort is PLASMA, that can perform computations on multicore architecture with the help of dynamic scheduler Queuing and Runtime for Kernels (QUARK). PLASMA creates pipeline model for parallel execution by dynamic scheduling of BLAS kernels on the multicore platform [16][3]. A Formal Linear Algebra Method Environment (FLAME) focuses on issues related to programming of linear algebra programs. The focus of the FLAME project is to automatically generate efficient linear algebra codes for the underlying platform [17][18]. Under the umbrella of FLAME project, BLASlike Library Instantiation Software (BLIS) focuses on rapid scheduling of BLASlike kernels on multicore architectures. Automatically Tuned Linear Algebra Software (ATLAS) is an matured open source package that generates BLAS for the underlying platform [19][20]. ATLAS relies on legacy BLAS for generation of efficient code for the underlying platform where several parameters are tweaked to suit the underlying platform. OpenBLAS is another open source package that focuses on efficient realization of DLA computations [21][22]. OpenBLAS relies on GotoBLAS for the performance where GotoBLAS is a set of assembly programs written for DLA computations. A major shortcoming of the packages like LAPACK, PLASMA, BLIS, ATLAS, and OpenBLAS is lack of support from the underlying platform resulting in 1520% of the theoretical peak performance.
2.2 GPU Based Realizations
GPUs were originally designed for graphics processing are highly suitable for general purpose computing. There have been several packages developed to perform efficient BLAS functionality on GPUs. The most prominent of all of them is MAGMA software package [4]. MAGMA relies on MAGMA BLAS for the performance where the performance of MAGMA DGEMM is observed to be 57% of the peak performance of Tesla C2050 GPU with theoretical peak of 512 Gflops for double precision. KAUST BLAS (KBLAS) is one of the most recent and ongoing research project at KAUST. KBLAS internally relies on Cuda BLAS developed by Nvidia for the performance on GPU. BLASX presented in [23] focuses on optimizaiton of Level3 BLAS in multiGPU environment. BLASX minimizes the global communication through two level hierarchical tile cache structures and achieves 92.68% of the incore cuBLAS DGEMM. BLASX also contains better load balancing techniques compared to MAGMA and cuBLAS. Despite elegant scheduling technique and efficient exploitation of memory hierarchy BLASX, the performance achieved by BLASX is limited by cuBLAS DGEMM. In [24], requirement for cache memory is studied in detail for achieving superliear speedup for XGEMM. The study presented in [24] has no mention of data type if it is single precision or double precision. All the GPU based realization of BLAS fail to achieve high performance due to lack of support for GEMM primitives.
2.3 Custom Realizations
Customized accelerators are the class of architectures that are tuned for low energy, and unit area at high throughput for domain of interest [25][26]. Cell Broadband Engine (CBE) from International Business Machine (IBM) is a high performance architecture designed based on Power PC core [27]. Due to energy efficiency of CBE, it is viewed as an ideal platform for scientific computing [28]. ClearSpeed’s CSX architecture is back bone of ClearSpeed CSX600 and CSX700 processors. These processors have very high energy efficiency and operate at 12 Watts with theoretical peak of 96 GFlops [29][30][31]. A major shortcoming of ClearSpeed’s CSX and CBE architectures are low Gflops/W and Gflops/ [31].
There have been several attempts in viewing Field Programmable Gate Arrays (FPGAs) as high performance computing engines [32][33][34]. Mostly FPGAs are envisioned as a high performance coprocessor of a programmable host processor for compute intensive applications [35][36]. A major disadvantage of FPGAs is higher power consumption than an Application Specific Integrated Circuit (ASIC) counterpart of the same logic. FPGAs are also limited by the onchip logic resulting in scalability issues in high performance computing applications.
To overcome shortcomings of the existing architectures in exploiting parallelisms available in DLA computations, we take a route of algorithmarchitecture codesign where we ensure high performance along with energy, area efficiency, and scalability.
3 BLAS Realization on Offtheshelf Processors
GEMM (Level3 BLAS) and GEMV (Level2 BLAS) are the most prominent routines in many engineering and scientific computations. These routines also have pedagogical importance due to its simplistic nature and often used to evaluate emerging architectures. In this section, first we discuss GEMM algorithm and then we examine some of the recent realization of GEMM. Based on the anslysis of the profiling of the BLAS routines, we arrive at the pitfalls in extracting performance out of GEMM on contemporary multicore and GPU platforms. We further decide to design our own customized platform that is capable of extracting performance in BLAS through efficiently exploiting parallelism in BLAS.
3.1 GEMM and GEMV Algorithms
Pseudo codes for GEMM and GEMV are described in algorithms 1 and 2 respectively. GEMM algorithm has three loops and hence it belongs to Level3 BLAS while GEMV belongs to Level2 BLAS. For multiplying two matrices of size , it takes multiplications and additions while GEMV takes multiplications and additions. Typically, GEMM and GENV exhibit Instruction Level Parallelism (ILP) and Data Level Parallelism (DLP). GEMM also exhibits mighty data locality and is capable of sustaining computations to communication ratio. All together, if exploited efficiently and accelerated, GEMM and GEMV become an ideal candidate to be used as a basic building block for many high performance scientific applications. Since, GEMM has three nested loops, these loops can be permuted to change the access pattern of input matrices as shown in table I.
Loop Order  Inner Loop  Middle Loop  Inner Loop Data Access 

ijk  A by row, B by column  
jik  A by row, B by column  
ikj  row  B by row, C by row  
jki  column  A by column, C by column  
kij  row outer product  B by row, C by row  
kji  column outer product  A by column, B by column 
3.2 Performance Evaluation of GEMM and GEMV
For contemporary architecture, highly efficient GEMM is realized as a subroutine in BLAS. There exists several vendor specific realizations of DGEMM. For our experiments, we take DGEMM available in BLAS from The Netlib and for evaluation on GPU we use MAGMA_DGEMM. We compile DGEMM for different Intel and AMD machines with different compiler options and evaluate the performance of DGEMM for these architectures. We evaluate MAGMA_DGEMM on Telsa C2050.
Figure 2(a) depicts CPI of DGEMM when executed on Intel Haswell and AMD Bulldozer microarchitectures. For experimental results shown in figure 2(a), we have used BLAS and CBLAS^{1}^{1}1CBLAS consists of C wrappers written around BLAS where BLAS is written in Fortran available in The Netlib and hence we have compiled BLAS and CBLAS using publicly available and . It can be observed in figure 2(a) that the CPI in the DGEMM saturates at around for Intel’s Haswell and AMD’s Bulldozer. For the matrices that fit in the cache achieve CPI that is lower than that for the matrices that do not fit in the cache. This is due to cache misses observed for larger matrices. While for the matrices that do not fit in the cache memory, attained CPI is slightly higher than the smaller matrices^{2}^{2}2Just to reemphasize: In case of CPI, lower the better. For Intel Haswell and AMD Bulldozer, the lower bound of the CPI is . It can be observed that with DGEMM, which is highly optimized routine of BLAS, CPI achieved is nowhere close to the lower bound of CPI of the architecture. Similar trend is observed when we consider Gflops as a performance metric as shown in figure 2(b).
It can be observed in the figure 2(b) that, for the matrices that fit in the cache memory, the Gflops attained is higher. For the larger matrices that do not fit in the cache memory, Gflops decreases due to cache misses. While these architectures have peak performance of Gflops, attained performance is 1011% of the peak performance.
One way to improve performance is to use the vendor specific compilers, since vendor specific compilers perform architecture aware optimizations in the programs. In order to further push the performance of DGEMM on Intel Haswell microarchitecture we use Intel C Compiler (icc) for compiling DGEMM routine in BLAS. Performance improvement in CPI and Gflops is shown in figures 2(c) and 2(d) respectively.
It can be observed in the figures 2(c) and 2(d) that the performance improvement in DGEMM is still far from the lower bound of the CPI and peak Gflops.
In the next set of experiments, we add compiler switch while compiling with . Performance improvement due to these switches is shown in the figure 2(e) and figure 2(f).
It can be observed from figures 2(e) and 2(f) that compiler switch improves performance and finally we are able to achieve 1517% of the peak IPC (or CPI) and peak Gflops for DGEMM. Percentage of peak performance achieved is 45% for DGEMV and 5557% for DGEMM in Tesla C2050 as depicted in figure 2(g) while percentage of peak performance achieved in Intel and Nvidia machines for DGEMV and DGEMM is ranging from 5% to 57% as shown in the figure 2(h). Considering Gfops/watt as a performance parameter, DGEMV and DGEMM in the BLAS achieve performance of 0.14 Gflops/watt and 0.25 Gflops/watt respectively while MAGMA_DGEMV and MAGMA_DGEMM achieve performance of 0.03 Gflops/watt to 0.225 Gflops/watt respectively as shown in figure 2(i). One more observation we make from the figure 2(e) and VTune™that use of compiler switch along with reduces number of instructions by half. This is because of use of FMA instructions in the generated assembly code. Reduction in the number of instructions leads to increase in CPI measured by VTune™. Though, there is an increase in CPI, performance in terms of Gflops is observed to be improved as shown in the figure 2(f). Hence, CPI measured by VTune™can not be considered as a correct measure for performance. We define terms CyclesperFlops (CPF) to be used instead CPI and FlopsperCycle (FPC) to be used instead IPC as follows:
(1) 
We define FPC as follows:
(2) 
CPF and FPC help us to evaluate performance of the architectures and algorithms more effectively. This is because the granularity of the compute resources considered in CPF and FPC is at the level of floating point operation and not at the level of Fused Multiplyadd (FMA).
Across the experiments, we can observe that, significant efforts are needed to improve the performance of DGEMV and DGEMM on contemporary architectures and yet the attained performance is not satisfactory. We address this challenge of extracting performance from DLA computations through algorithmarchitecture codesign in the subsequent sections of this paper.
4 Analysis of BLAS and CFU
In this section, we present graph based analysis of several Level1, Level2, and Level3 BLAS. We choose several representative routines in all three levels of BLAS^{3}^{3}3Due to availability of double precision floating point unit, we consider only routines that are with prefix ”d”. For example first ”d” in represents double precision.
4.1 Vector Operations (Level1 BLAS)
Level1 BLAS typically has operations for a vector size of and data movement required is also . We analyze , , and operations here. Figure 3 represents inner product of two vectors given by equation 3.
(3) 
where , and . DAG for is shown in figure 3 for .
The routine has application in matrixvector and matrixmatrix multiplication.
(4) 
(5) 
It can be observed from the DAGs of and that the DAGs of these two routines are similar except presence of square root in routine. can be realized with same multiplier and adder resources as . It can also be observed in the figure 3 that the first level in the DAGs is multiplication and all these multiplications can potentially be executed in parallel. The next levels of DAGs of and are additions. Additions in each level of DAGs can be performed simultaneously if the inputs are available.
4.2 Matrixvector Operations (Level2 BLAS)
Matrixvector operations are typically for input matrix of size and vector of size and data movement required is also . In our analysis we consider double precision matrixvector multiplication (DGEMV) routine of BLAS. DGEMV is given in equation 6.
(6) 
where is a matrix of size and and are vectors of size . DGEMV routine has multiplications, additions and additions to compute final vector . In our DAG analysis we consider matrixvector multiplication since matrixvector multiplication is compute intensive part of the routine and it is essential to exploit parallelism in matrixvector multiplication to accelerate DGEMV routine. DAGs for matrixvector multiplication are shown in in the figure 4.
It can be observed from DAGs of matrixvector multiplication that all the multiplication in matrixvector multiplication can be executed in parallel and matrixvector multiplication can be realized as series of routine calls.
4.3 Matrixmatrix Operations (Level3 BLAS)
Here we first review some of the matrix multiplication algorithms and present graph based analysis of these algorithms. We briefly discuss parallelism in different matrix multiplication algorithms. Based on the analysis, we choose matrix multiplication algorithm. In the following subsection of the paper, we present design of a PE that can efficiently exploit ILP in the chosen matrix multiplication algorithm.
4.3.1 Matrix Multiplication Algorithms
Over the years there have been several matrix multiplication algorithms proposed in the literature. In this subsection, we review and analyze Strassen’s Matrix Multiplication (SMM), Winograd’s Matrix Multiplication (WMM), and General Matrix Multiplication (GEMM). We consider , and as input matrices and as output matrix where are equal sized block matrices and .
4.3.2 Strassen’s Matrix Multiplication
SMM algorithm is described in table II for block matrix.
Level 1  Level 2  Level 3  Level 4 

Typically, SMM has two steps, 1) decompose step, and 2) merge step. In decompose step, matrix is divided in block matrices and to are computed. In merge step, to are computed. Directed Acyclic Graphs (DAGs) for SMM are shown in figure 5 for . It can be observed in the DAGs of SMM that the computation of to depends on computations of to .
This dependencies in the DAGs of SMM results in higher execution time of SMM. It can also be observed from DAGs of SMM in figure 5 that M1 to M7 can potentially be executed in parallel and C11 to C22 can also be executed in parallel. One of the way this parallelism can be exploited is through processor pipeline and pipelined arithmetic units. Asymptotic complexity of SMM is .
4.3.3 Winograd’s Matrix Multiplication
WMM algorithm operates on the same principle as SMM as shown below:
Level 1  Level 2  Level 3  Level 4  Level 5  Level 6 

It can be observed from WMM algorithm that it takes block matrix multiplications and matrix additions unlike SMM where the number of block matrix multiplication is same but the number of matrix additions are . DAGs for WMM are shown in figure 5 for .
WMM has same asymptotic complexity as SMM. In practical scenarios, execution time of WMM is observed to be slightly less than SMM due to fewer additions.
4.3.4 General Matrix Multiplication
GEMM for multiplication of and can be described by following expressions:
DAGs for a matrix is shown in figure 5. It can be observed from the DAGs of GEMM that it takes multiplications and additions. Asymptotic complexity of GEMM is .
SMM and WMM have lower asymptotic complexities compared to GEMM. A major disadvantage of SMM and WMM is that they are more suitable for square matrices where size is a power of two. For the matrix sizes where this condition is not met, a complex matrix partitioning scheme is required. Hence, we adopt GEMM over SMM and WMM due to following reasons:

A complex partitioning scheme required for the matrices in SMM and WMM results in a intricate scheduling scheme for the blocks of input matrices. A way to alleviate these complications is to pad the matrices. This padding results in few more computations, mostly . The padding does not reduce the complexity of the implementation since naive padding scheme is not efficient

GEMM has higher pedagogical importance than SMM and WMM. GEMM is highly preferred to evaluate the emerging architectures over SMM and WMM due to its simple structure and ease of implementation
4.3.5 Anatomy of General Matrix Multiplication
To discuss available parallelism in GEMM, we take a matrix multiplication of size as an example.
DAGs for for algorithm 1 are shown in figure 6 for computation of elements to . It can be observed in the figure 6 that all the multiplications in the block of the matrix can be computed in parallel. The dependencies are due to accumulation of the multiplied elements of the input matrices.
Potentially, in multiplication of matrix of size , all the multiplications can be computed in parallel. In case of matrix, all elements can be computed in parallel as shown in figure 6. The accumulation process while computing the elements of the resultant matrix enforce the dependencies resulting in pipeline stalls. These pipeline stalls can be avoided by computing multiple elements in parallelpipeline manner.
Algorithm 3 depicts DGEMM with block matrix multiplication (assuming that the matrix dimensions are multiple of ). In algorithm 3, BLOCK4MUL is multiplication of matrices of size , and BLOCK4ADD is addition of matrices of size . A pitfall in the unrolling scheme is the exigency of locally available registers. Typically, for matrix, if fully unrolled, requires registers. Hence, for a large matrix, it can not be unrolled due to lack of locally available registers, but a small block of the matrix can be unrolled to exploit the fine grained parallelism in the block through processor pipeline and pipelined arithmetic units. In our experiments with PE explained in section 4.4, we have adopted a conservative approach where we have assumed space for intermediate results in the loacal registers and hence we have considered a block matirx with 64 registers of 64bit wide. In parallel realization of GEMM, different blocks of can be computed in parallel on different PEs. While realizing GEMM on a single PE, we try to exploit parallelism that is available in a block of and in parallel realization on REDEFINE, we try to exploit parallelism across the blocks.
In the next section we present a PE design to skillfully exploit the parallelism that exist in the block of matrix.
4.4 Processing Element Design
For initial design of PE, we consider classical sequential architecture model. As a first design, we take a fully pipelined double precision floating point adder, and multiplier arithmetic units as compute resources. Architecture of PE is shown in figure 7.
As shown in the figure 7, the initial design of PE consists of an Instruction Memory, a Decoder to decode the instructions, a Register File with registers, and pipelined double precision Floating Point Unit (FPU) [39][40]. The FPU consists of a multiplier, an adder, a square root, and a divider. For computing matrix multiplication with large matrices, we choose as a block matrix. For the matrices that are not multiple of , we partition them in the blocks of as many times as possible and for the residual matrix, we perform unblocked multiplication. Operation of the PE can be described in the following steps:

Step 1: Bring the input matrices to the Register File that by sending Load request to the upper level of the memory

Step 2: Perform matrix multiplication

Step 3: Store back the resultant matrix to the upper level of memory
4.5 Simulation Environment and Initial Results
For simulations we connect PE shown in the figure 7 to external/global memory. Initially we use 64 ( wide) registers, 16KB of instruction memory for our experiments. We model global/external memory delay by placing pipelined delay of stages.
4.5.1 Initial Results
For our experiments, we choose matrix sizes , , , , and as a representative matrix sizes for our experiments.
Matrix Size  Experimental Latencies  CPF  Gflops/W 

39000  1.625  16.66  
310075  1.614  16.87  
1040754  1.606  17.15  
2457600  1.6  17.25  
4770000  1.59  17.38 
It can be observed in the table IV that as we increase the matrix size, the CPF decreases and saturates around while performance in terms of Gflops/watt is observed to be at 17.3 Gflops/watt at 0.2 GHz. In other words, as we increase the matrix size, the FPC saturates at 62.5% of peak floating point operations per cycle. Here, since, we can potentially compute one multiplication and one addition in parallel, the peak FPC = 2.
In this section, we reviewed some of the matrix multiplication techniques. We discussed asymptotic complexity and graph based analysis of three different matrix multiplication algorithms. We justified our choice of GEMM over SMM and WMM algorithms. We presented additional details of GEMM with an example of matrix and proposed an initial design of a PE that achieves CPF of and performance of 17.3 Gflops/watt. Intuitively, performance of the PE can be improved methodically by enriching the PE with compute and memory resources.
5 Microarchitectural Enhancements in PE and Parallel Realization on REDEFINE
Based on anatomy of the GEMM and design of the PE presented in section 4.4, in this section we dwell on microarchitectural enhancements of the PE. We methodically enhance PE that improves CPF. The PE described in section 4.4 is considered as a Floating Point Sequencer in this section.
5.1 Introducing Local Memory and LoadStore CFU
Major drawbacks of the PE design presented in section 4.4 are no overlap of computations and communication and lack of exploitation of data locality in GEMM. To address these issues, we introduce a LoadStore CFU that operates simultaneously with FPS (depending on availability of data), and facilitates overlap of computation and communication. We also place a Local Memory (LM) of size kbits in the LoadStore CFU to exploit data locality in GEMM. Enhanced PE design with FPS and LoadStore CFU is shown in figure 8.
Introduction of an LM in LoadStore CFU inside PE improves the data locality. This improved data locality creates further opportunities for exploitation of higher ILP by increasing compute resources in FPS. Increased compute resources in the FPS demand for improvisation in the data availability in the Register File for computations. In this section we present methodical architectural enhancements in the PE for reduction in latency^{4}^{4}4Latency in terms of clock cycles in execution of GEMM. These enhancements in the PE ensures lower latency in execution of GEMM leading to overlap between computation and communication upto 90% in GEMM and we are also able to achieve upto 74% of the peak CPF^{5}^{5}5Peak CPF =. To highlight the reduction in the latency due to each architectural enhancement, we consider , , , , and matrix sizes as a representative for our experiments. Reduction in the latency due to introduction of LoadStore CFU and LM (refer figure 8) is shown in the table V.
Matrix Size  
Latency (in clock cycles) without LM  39000  312075  1040754  2457600  4770000 
Latency (in clock cycles) with LM  23000  178471  595421  1410662  2730365 
Improvement in Latency in terms of percentage  41%  42.5%  42.78%  42.6%  42.6% 
Gflops/watt  14.87  15.53  15.77  15.81  15.98 
It can be observed in the table V that introduction of LM in the LoadStore CFU improves performance by x and as we increase matrix size the performance improves due to improved data locality.
5.2 Special Instructions
In the first enhancement, we try to exploit higher ILP by increasing resources in the FPS that in turn improves performance significantly. This improved performance motivates us to improve data availability in the Register File residing inside FPS. We introduce two types of special instruction: 1) DOT instructions that are executed in FPS on a specialized fully pipelined hardware structure, and 2) Block Data Load and Block Data Store instructions that are executed in the LoadStore CFU.
5.2.1 DOT Instruction
Since we support block size of , we introduce a hardware that can perform inner product of a element vector. The hardware structure to compute element vector inner product is shown in figure 9. We further make this hardware structure reconfigurable to support element and element vector inner products to support different matrix sizes. We name this unit as a Reconfigurable Datapath (RDP). Through reconfiguration, RDP can be recasted to perform macro operations encountered in some of the algorithms in BLAS discussed in the section 4. The RDP is shown in figure 9.
For larger matrices () that do not fit in the Register File in FPS for matrix multiplication, we use block size of . For the matrices that do not have their size as multiple of , we use element, and element inner product configurations of RDP. In this exposition, we restrict our experiment to the matrices with size of multiple of and hence we use element inner product configuration (also termed as DOT4 configuration) of RDP. DOT4 configuration of RDP has a stage deep pipeline. Assuming no pipeline stalls, we can potentially maintain DOT4 instructions in a state of execution. This DOT4 instruction leads to exploitations of higher ILP in a block of in GEMM. Improvement in latency of GEMM due to DOT4 instruction is shown in table VI.
Matrix Size  

Latency (in clock cycles)  15251  113114  371699  877124  1696921 
Improvement over table V  33.7%  36.6%  37.57%  37.82%  37.85% 
Gflops/watt  10.52  11.49  11.85  11.93  12.06 
It can be observed from the table VI that as we increase the matrix size, the benefit due to DOT instruction improves. This is due to improved exploitation of ILP in the FPS.
5.2.2 Block Data Load and Block Data Store Instructions
We further aim to reduce handshaking between LM and GM. This reduction in the handshaking inturn improves data availability in the Register File. In order to reduce the handshaking between PE and the next level of the Memory, we introduce instructions that can load/store data in a block fashion. Performance improvement due to Block Data Load and Block Data Store is shown in table VII where we have used as a block size for the transfer.
Matrix Size  

Latency (in clock cycles)  12745  97136  324997  784838  1519083 
Improvement over table VI  16.4%  14.1%  12.5%  10.51%  10.48% 
Gflops/watt  12.59  13.38  13.56  13.33  13.47 
It can be observed from the table VII that as we increase matrix size, the benefit due to Block Data Load/Store does not improve. Rather the performance is observed to be saturating. This is because of the constant block size of across all the matrix sizes. Supporting larger block size is not possible due to limited registers availability in the Register File in the FPS. It can also be observed in table VII that the latency gap between and , and and is also decreasing and it is likely to saturate at some point. Further experiments show that the gap saturates at for larger matrix sizes.
5.3 Bandwidth Increase
Increased resources in the FPS improves performance by almost x, and reduced handshaking between LM and the upper level of the memory improves performance by 10%. We still see significant gap between our desired performance and attained performance. The reason for this gap is mainly because of under utilization of the RDP that is configured as DOT4. In order to improve resource utilization of RDP, in this architectural enhancement, we increase bandwidth between FPS and LoadStore CFU to times. We consider increase in the bandwidth to times since the block size supported in FPS is . We transfer bits between FPS and LoadStore CFU in contrast to previous realization where we transfered bit data. The communication between FPS and LoadStore CFU at higher rate ensures better data availability in the Register File of FPS. The performance improvement due to increase in the bandwidth is shown in table VIII.
Matrix Size  

Latency (in clock cycles)  7079  52624  174969  422924  818178 
Improvement over table VII  44.4%  45.8%  46.1%  46.12%  46.14% 
Gflops/watt  22.67  24.71  25.19  24.95  25.02 
It can be observed in table VIII that as we increase the matrix size the benefits due to increased bandwidth between FPS and LoadStore CFU in the PE improves. This is mainly because of better utilization of RDP (here configured as DOT4).
5.4 Prefetching
To improve the utilization of RDP further in the FPS, we restructure the loop in GEMM. We rewrite algorithm 1 as algorithm 4.
Restructuring the loops in the algorithm allows us to prefetch the matrix block required for the next iteration. This results in better exploitation of FPU pipeline by reduced instruction stalls in FPS as shown in figure 10. The shaded portion in the figure 10 depicts the reduction in the instruction stalls in FPS when there is a prefetch of the block of the matrix required in the next iteration for computation. In figure 10, there are two portions, 1) before prefetching, and 2) after prefetching. Arrows in the figures depict execution of different types of operations such as computations in FPS, loading/storing of data from/to GM (or EM) memory, loading/storing of data from/to GM.
Matrix Size  

Latency (in clock cycles)  5561  38376  124741  298161  573442 
Improvement over table VIII  21.44%  27.07%  28.70%  29.5%  29.9% 
Gflops/watt  28.86  33.88  35.33  35.11  35.70 
Improvement attained by prefetching is shown in table IX. It can be observed in the table IX that as we increase matrix size the benefits due to prefetching increases. This is mainly because of improvement in data availability in the Register File of the FPS.
Collectively, the reduction in execution cycles of GEMM can be seen in the figure 11(a) as we perform different architectural enhancements in the PE. It can be clearly observed that finally we get speedup of 7x for matrix of size , 8.13x for the matrix of size , and 8.34x for the matrix of size [14].
It can be observed from figure 11(b) that as we perform different architectural enhancements, the ratio of Latency to computations reduces. If we denote the ratio of Latency to total computations by then
(7) 
It can be observed in the figure 11(b) that as we increase matrix size asymptotically approaches . is a case where there is a complete overlap of computation and communication. Complete overlap of computations and communication is not possible in the real life scenario and hence can never become .
Figure 11(c) depicts CPF for matrices of size , , and . It can be observed in the figure 11(c) that as we perform enhancements the CPF tends to decrease and this trend is observed across all the matrices. Figure 11(d) depicts FPC (where FPC = 1/CPF). It can be observed from figure 11(d) that, as we perform enhancements in the PE, FPC improves dramatically. Although CPF and FPC are good measure of performance of a PE, they do not convey enough information about how efficiently compute resources are utilized in the PE.
Percentage of peak FPC attained in PE after every enhancement is shown in figure 11(e). Figure 11(e) depicts an interesting trend where the peak FPC reduces drastically and then with further architectural enhancements, improves. As our enhancements suggests, in the first enhancement where we place LoadStore CFU with LM for overlap of computation and communication, the FPC achieved saturates at 54% of the peak FPC^{6}^{6}6Here peak FPC = . We strongly intend to break this saturation point since 54% of the peak is not a satisfactory performance. In order to break this saturation point we further enhance FPS with compute resources that leads to higher theoretical peak FPC^{7}^{7}7Here peak FPC = . Increase in peak FPC is due to DOT4 instruction. At this point achieved FPC reduces at as shown in the figure 11(e). This is because of increased compute resources. Our further enhancements help us to improve the resource utilization of the increased resources in FPS and achieve upto 74% of the peak FPC of the PE.
We presented methodical architectural enhancements to improve the performance of PE. Through architectural customizations, we could break saturation point at 54% and improve performance of PE. In other words, we showed that the performance of the algorithms can be improved by customizations that are specific to the algorithms. Here, we showed this with example of GEMM which is a Level3 BLAS. Finally, 35.7 Gflops/watt in the PE is achieved through carefull realization of Level3 BLAS.
5.5 Parallel Realization of Level3 BLAS
For parallel realization of DGEMM, our two different simulation environments are shown in figure 11(k). In figure 11(k), shaded portion of the Tile array (except the last column) which is Tiles is used for the computations where we realize DGEMM while the last column is used for storing input and output matrices. We use Octave for generating input matrices. Similarly, we use portion of the Tile array for realizing DGEMM.
In our experiments, if output matrix is of size then we divide the output matrix into blocks of where is the Tile array that we are using. In our experiments or . For example, if output matrix is of size , and we are using of the Tile array to compute the DGEMM, we divide output matrices into block matrices. Now, in each Tile that we are using, we compute one of the block of size . Similarly, if output matrix is of size , and Tile array that is used for computing DGEMM is of size then block of is computed in each of the Tile of REDEFINE.
Speedup attained in REDEFINE is is shown in the figure 12 when Tile array of size , , and are used for the experiments. It can be observed from the figure 12 that when we use Tile array of the speedup over PE realization approaches as we increase matrix size. When we use Tile array of the speedup over PE realization approaches , and for Tile array of size the speedup attained approaches 16. For small matrices, communication with the last column of the Tile array is dominant over computations in the Tile. For example, for a matrix of size and Tile array size of , each Tile computes block of the resultant matrix. Ignoring the coefficient of the highest order term, there will be computations over loads/stores. Computation to communication ratio in for matrix will be in each Tile. For matrix a of size where each Tile will compute a block of matrix, computation to communication ratio is . One more observation we make here is that, as we increase the matrix size, speedup in REDEFINE over PE saturates. This is because of the saturation in the parallelism exploited by the PE that is attached in each Tile of REDEFINE.
In this section, we presented results for the PE that we presented in section 4.4
. We use the estimation methodology presented in
[31], [41], and [26] for fair comparison of the platforms. As shown in the figure 11(j), it can be observed that the performance of the PE is 40140x better than Intel Core architectures while 7139x better than Nvidia GPUs. Compared to Altera FPGA, PE is 10x better in terms of Gflops/watt while compared to ClearSpeed CSX700 it is almost 3x better.6 Conclusion
While the recent realizations for matrix computations focus on architectural customization for DLA, in this paper we presented a novel way of algorithmarchitecture codesign for breaking the performance saturation point in BLAS and presented a systematic enhancements in the microarchitecture for exploiting underlying compute and memory resources more efficiently. In algorithmarchitecture codesign, we first realized Level3 BLAS on offtheshelf processors and exhibited that the performance achieved by these offtheshelf processors in DGEMM is not satisfactory. The performance in the offtheshelf Intel and AMD processors saturates at 1517% at 65W and 57% in Nvidia Tesla C2050 GPGPU with the best optimization efforts. This dissatisfactory performance is mainly due to inefficiently exploited compute and memory resources of the underlying platform. We could sense a scope here in breaking this performance saturation point in a custom architecture and performed analysis of BLAS, and designed a PE that could efficiently execute BLAS routines at much higher Gflops/watt. We further enhanced this PE with several features such that we could efficiently exploit compute resources and memory resources in the PE and achieve performance of 35.7 Gflops/watt that is much higher than the offtheshelf Intel and AMD processors and GPGPUs. We attached this PE in REDEFINE for parallel realization of BLAS and showed that the speedup achieved in the parallel realization is commensurate with the number of Tiles used and hence we showed that our solution is scalable.
References
 [1] E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen, LAPACK Users’ Guide, 3rd ed. Philadelphia, PA: Society for Industrial and Applied Mathematics, 1999.
 [2] E. Anderson, Z. Bai, J. Dongarra, A. Greenbaum, A. McKenney, J. Du Croz, S. Hammerling, J. Demmel, C. Bischof, and D. Sorensen, “Lapack: A portable linear algebra library for highperformance computers,” in Proceedings of the 1990 ACM/IEEE Conference on Supercomputing, ser. Supercomputing ’90. Los Alamitos, CA, USA: IEEE Computer Society Press, 1990, pp. 2–11. [Online]. Available: http://dl.acm.org/citation.cfm?id=110382.110385
 [3] J. Kurzak, P. Luszczek, A. YarKhan, M. Faverge, J. Langou, H. Bouwmeester, and J. Dongarra, “Multithreading in the plasma library,” 2013.
 [4] B. J. Smith, “R package magma: Matrix algebra on gpu and multicore architectures, version 0.2.2,” September 3, 2010, [Online] http://cran.rproject.org/package=magma.
 [5] N. Farahini, S. Li, M. Tajammul, M. Shami, G. Chen, A. Hemani, and W. Ye, “39.9 gops/watt multimode cgra accelerator for a multistandard basestation,” in Circuits and Systems (ISCAS), 2013 IEEE International Symposium on, May 2013, pp. 1448–1451.
 [6] F. Bouwens, M. Berekovic, A. Kanstein, and G. Gaydadjiev, “Architectural exploration of the adres coarsegrained reconfigurable array,” in Proceedings of the 3rd International Conference on Reconfigurable Computing: Architectures, Tools and Applications, ser. ARC’07. Berlin, Heidelberg: SpringerVerlag, 2007, pp. 1–13. [Online]. Available: http://dl.acm.org/citation.cfm?id=1764631.1764633

[7]
F. A. G. N. S K Nandy, Ranjani Narayan, “Efficient QR Decomposition Using Low Complexity Columnwise Givens Rotation (CGR),”
VLSI Design 2014, pp. 258–263.  [8] J. York and D. Chiou, “On the asymptotic costs of multiplexerbased reconfigurability,” in The 49th Annual Design Automation Conference 2012, DAC ’12, San Francisco, CA, USA, June 37, 2012, 2012, pp. 790–795. [Online]. Available: http://doi.acm.org/10.1145/2228360.2228503
 [9] G. Ansaloni, P. Bonzini, and L. Pozzi, “Egra: A coarse grained reconfigurable architectural template,” Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, vol. 19, no. 6, pp. 1062–1074, June 2011.
 [10] Z. E. Rákossy, F. Merchant, A. A. Aponte, S. K. Nandy, and A. Chattopadhyay, “Efficient and scalable cgrabased implementation of columnwise givens rotation,” in ASAP, 2014, pp. 188–189.
 [11] Z. E. Rákossy, F. Merchant, A. A. Aponte, S. K. Nandy, and A. Chattopadhyay, “Scalable and energyefficient reconfigurable accelerator for columnwise givens rotation,” in 22nd International Conference on Very Large Scale Integration, VLSISoC, Playa del Carmen, Mexico, October 68, 2014, 2014, pp. 1–6. [Online]. Available: http://dx.doi.org/10.1109/VLSISoC.2014.7004166
 [12] M. Alle, K. Varadarajan, A. Fell, R. R. C., N. Joseph, S. Das, P. Biswas, J. Chetia, A. Rao, S. K. Nandy, and R. Narayan, “REDEFINE: Runtime reconfigurable polymorphic asic,” ACM Trans. Embed. Comput. Syst., vol. 9, no. 2, pp. 11:1–11:48, Oct. 2009. [Online]. Available: http://doi.acm.org/10.1145/1596543.1596545
 [13] J. Nimmy, C. R. Reddy, K. Varadarajan, M. Alle, A. Fell, S. K. Nandy, and R. Narayan, “Reconnect: A noc for polymorphic asics using a low overhead single cycle router,” in ASAP, 2008, pp. 251–256.
 [14] F. Merchant, A. Maity, M. Mahadurkar, K. Vatwani, I. Munje, M. Krishna, S. Nalesh, N. Gopalan, S. Raha, S. Nandy, and R. Narayan, “Microarchitectural enhancements in distributed memory cgras for lu and qr factorizations,” in VLSI Design (VLSID), 2015 28th International Conference on, Jan 2015, pp. 153–158.
 [15] J. Dongarra, “The linpack benchmark: An explanation,” in Proceedings of the 1st International Conference on Supercomputing. London, UK, UK: SpringerVerlag, 1988, pp. 456–474. [Online]. Available: http://dl.acm.org/citation.cfm?id=647970.742568
 [16] A. YarKhan, J. Kurzak, and J. Dongarra, “Quark users’ guide: Queueing and runtime for kernels,” Innovative Computing Laboratory, University of Tennessee, Tech. Rep., 2011.
 [17] F. G. Van Zee, libflame: The Complete Reference. lulu.com, 2009.
 [18] P. Bientinesi, J. Gunnels, M. Myers, E. QuintanaOrtí, T. Rhodes, R. van de Geijn, and F. Van Zee, “Deriving dense linear algebra libraries,” Formal Aspects of Computing, vol. 25, no. 6, pp. 933–945, 2013. [Online]. Available: http://dx.doi.org/10.1007/s0016501102214

[19]
R. C. Whaley and A. Petitet, “Minimizing development and maintenance costs in
supporting persistently optimized BLAS,” Software: Practice and
Experience, vol. 35, no. 2, pp. 101–121, February 2005,
http://www.cs.utsa.edu/~whaley/papers/spercw04.ps
.  [20] R. C. Whaley, A. Petitet, and J. J. Dongarra, “Automated empirical optimization of software and the ATLAS project,” Parallel Computing, vol. 27, no. 1–2, pp. 3–35, 2001, also available as University of Tennessee LAPACK Working Note #147, UTCS00448, 2000 (www.netlib.org/lapack/lawns/lawn147.ps).
 [21] Q. Wang, X. Zhang, Y. Zhang, and Q. Yi, “Augem: Automatically generate high performance dense linear algebra kernels on x86 cpus,” in Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, ser. SC ’13. New York, NY, USA: ACM, 2013, pp. 25:1–25:12. [Online]. Available: http://doi.acm.org/10.1145/2503210.2503219
 [22] Z. Xianyi, W. Qian, and Z. Yunquan, “Modeldriven level 3 blas performance optimization on loongson 3a processor,” in Parallel and Distributed Systems (ICPADS), 2012 IEEE 18th International Conference on, Dec 2012, pp. 684–691.
 [23] L. Wang, W. Wu, Z. Xu, J. Xiao, and Y. Yang, “BLASX: A high performance level3 BLAS library for heterogeneous multigpu computing,” in Proceedings of the 2016 International Conference on Supercomputing, ICS 2016, Istanbul, Turkey, June 13, 2016, 2016, pp. 20:1–20:11. [Online]. Available: http://doi.acm.org/10.1145/2925426.2926256
 [24] L. Djinevski, S. Ristov, and M. Gusev, “Superlinear speedup for matrix multiplication in gpu devices,” in ICT Innovations 2012. Springer Berlin Heidelberg, 2013, pp. 285–294.
 [25] F. Merchant, T. Vatwani, A. Chattopadhyay, S. Raha, S. K. Nandy, and R. Narayan, “Achieving efficient qr factorization by algorithmarchitecture codesign of householder transformation,” in 29th International Conference on VLSI Design, VLSID 2016, Kolkata, India, January 48, 2016.
 [26] A. Pedram, A. Gerstlauer, and R. A. van de Geijn, “Algorithm, architecture, and floatingpoint unit codesign of a matrix factorization accelerator,” IEEE Trans. Computers, vol. 63, no. 8, pp. 1854–1867, 2014. [Online]. Available: http://dx.doi.org/10.1109/TC.2014.2315627
 [27] T. Chen, R. Raghavan, J. Dale, and E. Iwata, “Cell broadband engine architecture and its first implementation x2014;a performance view,” IBM Journal of Research and Development, vol. 51, no. 5, pp. 559–572, Sept 2007.
 [28] S. Williams, J. Shalf, L. Oliker, S. Kamil, P. Husbands, and K. Yelick, “Scientific computing kernels on the cell processor,” Int. J. Parallel Program., vol. 35, no. 3, pp. 263–298, Jun. 2007. [Online]. Available: http://dx.doi.org/10.1017/s1076600700345
 [29] Y. Nishikawa, M. Koibuchi, M. Yoshimi, A. Shitara, K. Miura, and H. Amano, “Performance analysis of clearspeed’s CSX600 interconnects,” in IEEE International Symposium on Parallel and Distributed Processing with Applications, ISPA 2009, Chengdu, Sichuan, China, 1012 August 2009, 2009, pp. 203–210. [Online]. Available: http://doi.ieeecomputersociety.org/10.1109/ISPA.2009.102
 [30] Y. Nishikawa, M. Koibuchi, M. Yoshimi, K. Miura, and H. Amano, “Performance improvement methodology for clearspeed’s CSX600,” in 2007 International Conference on Parallel Processing (ICPP 2007), September 1014, 2007, XiAn, China, 2007, p. 77. [Online]. Available: http://dx.doi.org/10.1109/ICPP.2007.66
 [31] A. Pedram, S. Z. Gilani, N. S. Kim, R. A. van de Geijn, M. J. Schulte, and A. Gerstlauer, “A linear algebra core design for efficient level3 blas,” in ASAP, 2012, pp. 149–152.
 [32] J. Gonzalez and R. C. Núñez, “Lapackrc: Fast linear algebra kernels/solvers for fpga accelerators,” Journal of Physics: Conference Series, vol. 180, no. 1, p. 012042, 2009. [Online]. Available: http://stacks.iop.org/17426596/180/i=1/a=012042
 [33] Y.G. Tai, C.T. D. Lo, and K. Psarris, “Scalable matrix decompositions with multiple cores on {FPGAs},” Microprocessors and Microsystems, vol. 37, no. 8, Part B, pp. 887 – 898, 2013, embedded Multicore Systems: Architecture, Performance and Application. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0141933112001007
 [34] S. Kestur, J. D. Davis, and O. Williams, “Blas comparison on fpga, cpu and gpu,” in Proceedings of the 2010 IEEE Annual Symposium on VLSI, ser. ISVLSI ’10. Washington, DC, USA: IEEE Computer Society, 2010, pp. 288–293. [Online]. Available: http://dx.doi.org/10.1109/ISVLSI.2010.84
 [35] L. Zhuo and V. K. Prasanna, “Scalable and modular algorithms for floatingpoint matrix multiplication on reconfigurable computing systems,” IEEE Trans. Parallel Distrib. Syst., vol. 18, no. 4, pp. 433–448, Apr. 2007. [Online]. Available: http://dx.doi.org/10.1109/TPDS.2007.1001
 [36] P. Zicari, P. Corsonello, S. Perri, and G. Cocorullo, “A matrix product accelerator for field programmable systems on chip,” Microprocess. Microsyst., vol. 32, no. 2, pp. 53–67, Mar. 2008. [Online]. Available: http://dx.doi.org/10.1016/j.micpro.2007.05.002
 [37] G. H. Golub and C. F. Van Loan, Matrix computations (3rd ed.). Baltimore, MD, USA: Johns Hopkins University Press, 1996.
 [38] N. J. Higham, Accuracy and Stability of Numerical Algorithms, 2nd ed. Philadelphia, PA, USA: Society for Industrial and Applied Mathematics, 2002.

[39]
F. Merchant, N. Choudhary, S. K. Nandy, and R. Narayan, “Efficient realization of table lookup based double precision floating point arithmetic,” in
29th International Conference on VLSI Design, VLSID 2016, Kolkata, India, January 48, 2016.  [40] F. Merchant, A. Chattopadhyay, S. Raha, S. K. Nandy, and R. Narayan, “Accelerating BLAS and LAPACK via efficient floating point architecture design,” CoRR, vol. abs/1610.08705, 2016. [Online]. Available: http://arxiv.org/abs/1610.08705
 [41] A. Pedram, R. A. van de Geijn, and A. Gerstlauer, “Codesign tradeoffs for highperformance, lowpower linear algebra architectures,” IEEE Trans. Computers, vol. 61, no. 12, pp. 1724–1736, 2012. [Online]. Available: http://doi.ieeecomputersociety.org/10.1109/TC.2012.132
Comments
There are no comments yet.