1 Introduction
QR factorization plays pivotal role in computing solution of linear systems of equation, solving linear least square problems, and computing eigenvalues. Such problems arise in navigation to wireless communication systems. For a matrix
of size , QR factorization is given by(1) 
where is orthogonal and is upper triangle matrix [1]
. There are several methods in the literature to perform QR factorization namely Givens Rotation (GR), Householder Transform (HT), Modified GramSchmidt (MGS), and Cholesky QR. Following are the two real life application examples, Kalman Filtering (KF) and QR algorithm where QR factorization is used as a tool to solve certain computational problems.
Application 1: Kalman Filtering
KF is used in navigation to econometrics since it is capable of filtering out noisy data and at the same time it also facilitates prediction of the next state. A simplistic multidimensional KF is shown in the figure 1
. In the KF there is an initial state that contains state matrix and process covariance matrix. Based on the current state and the previous state, next state is predicted as shown in the figure
1. Based on the predicted state and covariance matrix, and measured input, Kalman Gain (KG) is computed and KG is used to predict the next state. Using KG and the predicted state, error in the process is computed. A new state matrix and covariance matrices are output of the iteration that becomes previous state for the next iteration.An iteration of KF requires complex matrix operations ranging from matrix multiplication to computation of numerical stable matrix inverse. One such classical GR based numerical stable approach is presented in [2]. From figure 1, matrix inverse being the most complex operation in KF, and computation of inverse using QR factorization being a numerical stable process, proposed library based approach is the most suitable for such applications.
Application 2: Eigenvalue Computation
Computation of eigenvalues is simplified due to QR algorithm where QR algorithm is based on the QR factorization given in equation 1. Eigenvalue computation is shown in the algorithm 1. As a result of QR iteration, eigenvalues appear on the diagonal of the matrix while columns of the matrix
are the eigenvectors
[3]. QR algorithm has gained immense popularity for computing eigenvalues and eigenvectors. Some of the examples where eigenvalues and eigenvectors are useful are in communication where eigenvalues and eigenvectors are computed to determine the theoretical limit of the communication medium for the transfer of the information, dimentionality reduction in the principle component analysis for face recognition, and graph clustering
[4][5].Considering important engineering and scientific applications of QR factorization, it is momentous to accelerate QR factorization. Traditionally, for scalable realization of QR factorization, library based approach is favored due to modularity and efficiency
[6][7]. Graphical representations of HT based QR factorization (XGEQR2) and HT based block QR factorization (XGEQRF) routines in Linear Algebra Package (LAPACK) are shown in figure 2 where X stands for double/single precision version of the routine. It can be observed in the figure 2that the routine XGEQR2 is dominated by General Matrixvector (XGEMV) operations while XGEQRF is dominated by General Matrixmatrix (XGEMM) operations. Performance of XGEQRF is observed to be magnitude higher than XGEQR2 due to highly optimized XGEMM operations. XGEMV and XGEMM are part of Basic Linear Algebra Subprograms (BLAS). Typically, performance of LAPACK routines can be measured as a relative performance of BLAS XGEMM since in routines like XGEQRF (QR factorization), XGETRF (LU factorization with partial pivoting), and XPBTRF (Cholesky factorization), XGEMM is dominant and the performance achieved is usually 80% of the performance achieved by XGEMM for the underlying platform.
Contemporary multicore and General Purpose Graphics Processing Units (GPGPUs) are considered as an ideal platform for efficient realization of BLAS and LAPACK. Multicores are optimized for sequential programs and they are highly efficient in exploiting temporal parallelism exhibited by the routines while GPGPUs are more suitable for the routines that exhibit spatial parallelism [8][9][10][11]. Experimentally, none of these platforms are capable of exploiting parallelism that is exhibited by the BLAS and LAPACK routines very efficiently. Moreover, routines in BLAS and LAPACK can be further examined for attaining higher degree of parallelism. We quantify parallelism by depicting routines as a Directed Acyclic Graphs (DAGs) and average number of operation per level () in the DAGs is considered as a measure for finegrained parallelism exhibited by the routine. Higher means more parallelism in the routine.
For exploiting spatiotemporal parallelism in BLAS and LAPACK, domain specific customizations are recommended in the platform that is executing these routines [12][13][14]. We choose REDEFINE as a platform for our experiments. REDEFINE is a Coarsegrained Reconfigurable Architecture (CGRA) in which several Tiles are connected through a NetworkonChip (NoC) [15][16][17]. Each Tile contains a router for communication and a Compute Element (CE). CEs in REDEFINE can be enhanced with Custom Function Units (CFUs) specifically tailored for domain of interest. CFUs can be Application Specific Integrated Circuits (ASICs), Reconfigurable Datapath (RDP), and/or micro/macro reprogrammable units [18][19]. In the presented approach we rearrange computations in LAPACK routines visàvis amend CFU presented in [19] that can efficiently exploit parallelism exhibited by the modified routine. Thus our approach becomes algorithmarchitecture codesign. Considering importance of QR factorization in scientific computing, in this paper we focus on efficient realization of HT based QR factorization through algorithmarchitecture codesign. Contributions in this paper are as follows:

Firstly, we discuss evaluation of routines of BLAS and LAPACK on Intel multicore processors and Nvidia GPGPUs. We identify limitations of these machines in exploiting parallelism exhibited by the routines. It is shown that even with highly optimized Dense Linear Algebra (DLA) software packages, the contemporary multicore and GPGPUs are capable of achieving only 0.20.3 Gflops/watt

XGEQR2 routine of LAPACK that computes HT based QR factorization is revisited where it is identified that the computations in the classical HT can be rearranged to exhibit higher degree of parallelism that results in Modified HT (MHT). Empirically, it is shown that realization of MHT (DGEQR2HT) achieves better performance than realization of classical HT (DGEQR2) and better or similar performance compared to DGEQRF in LAPACK while through quantification of parallelism, theoretically it is shown that the parallelism available in the MHT is higher than that of exploited by contemporary multicore and GPGPUs. Realization of MHT on multicore and GPGPU is presented and compared with the stateoftheart tuned software packages for DLA. Source code of our implementation on multicore and GPGPU is supplied with the exposition

To exploit available parallelism in MHT, we realize MHT on Processing Element (PE) presented in [19]. We adopt methodology presented in [20] and identify macro operations in DAGs of MHT that can be realized on RDP resulting in 1.21.3x performance improvement over classical HT realization on the PE. MHT is capable of achieving 99.3% of the theoretical peak performance achieved by DGEMM in the PE shown in figure 14(c) which is counterintuitive as for multicore and GPGPU, the performance achieved by DGEQRF is mostly 8085% of the performance achieved by DGEMM.

Compared to multicore, GPGPU, and ClearSpeed CSX700, 380x performance improvement in terms of Gflops/watt is attained. Realization of MHT, outperforms realization of DGEMM as shown in figure 14(d). We also show scalability of our solution by attaching PE as a CFU in REDEFINE
Due to availability of double precision floating point arithmetic unites like adder, multiplier, square root, and divider, we emphasize on the realization of DGEQR2, and DGEQRF using HT and MHT
[21][22]. Organization of the paper is as follows: In section 2, we briefly discuss about REDEFINE and some of the recent realization of QR factorization. In section 3, case studies of DGEMM, DGEQR2, and DGEQRF is presented and limitations of the recent multicore and GPGPU in exploiting parallelism are identified. In section 4, HT is revisited and MHT is presented and realization of MHT on multicore and GPGPU is presented. In section 5, custom realization of HT and MHT is presented in PE. Parallel realization of MHT is also discussed in section 5. We summarize our work in 6.Nomenclature:
BLAS_DGEMM: Legacy realization of DGEMM
LAPACK_DGEQR2: Legacy realization of HT based QR factorization
LAPACK_DGEQRF: Legacy realization of HT based block QR factorization
LAPACK_DGEQR2HT: Realization of MHT based QR factorization
LAPACK_DGEQRFHT: Realization of MHT based block QR factorization
PLASMA_ DGEQRF: Legacy realization of HT based tiled QR factorization
PLASMA_DGEQRFHT: Realization of MHT based block QR factorization
MAGMA_DGEQR2: Legacy realization of HT based QR factorization
MAGMA_DGEQRF: Legacy realization of HT based block/tile QR factorization
MAGMA_DGEQR2HT: Realization of MHT based QR factorization
MAGMA_DGEQRFHT: Realization of MHT based block/tile QR factorization
2 Background and Related Work
We revisit REDEFINE microarchitecture briefly and discuss performance of some of the recent DLA computations realized on REDEFINE. Classical HT and its WY representation of HT is also discussed [1]. In the latter part of the section, we discuss some of the recent realization of QR factorization in the literature and their shortcomings.
2.1 REDEFINE and DLA on REDEFINE
A system level diagram of REDEFINE is shown in figure 3 where a PE designed for efficient realization of DLA is attached. Microarchitecture of the PE is depicted in figure 4.
The PE is taken through several architectural enhancements to improve the performance of the PE and also to ensure maximal overlap of computations and communication [19]. Performance variations in the PE due to architectural enhancements is shown in figure 5(a) while change in the performance in terms of Gflops/watt due to each enhancement is depicted in figure 5(b). Further details of PE can be found in [19], and [24]. Due to unique features of PE and REDEFINE, we choose PE for our methodology of algorithmarchitecture codesign for HT. It can be observed in figure 5(c) that the PE achieves almost 3140x performance improvement over some of the commercially available multicore, FPGA, and GPGPUs for DGEMM [23]. It is also shown in [23] that when PE used as a CFU in REDEFINE, facilitates scalable parallel realization of BLAS.
2.2 Classical HT
In this section, we briefly explain HT and its WY representation. Householder matrix for annihilation of elements in a matrix of size is given by
(2) 
where is orthogonal (for real matrices, and Hermitian for complex matrices) and is householder vector.
Computations steps for computing householder vector are as follows for matrix :
From and , we can compute as follows:
(3) 
where , , and . From vector, can be computed that annihilates , and . matrix is then multiplied with second and third columns of . In the next iterations, similar procedure is followed to annihilate updated by previous iteration. Directed Acyclic Graphs (DAGs) for annihilation of and are shown in figure 6.
It can be observed in the figure 6 that there exist minimal parallelism in computation of vector. Major source of parallelism is matrixvector operations and matrixmatrix operations encountered in computation of matrices. As per Amdahl’s law, the performance of the routine is limited by the piece of program that can not be parallelized that is computation of vectors [25].
2.3 Related Work
HT was first presented in [26] by Alston S. Householder in 1958 that showed significant improvement in terms of computations over classical Givens Rotation (GR). Since then there were several innovations at algorithmic and technology level that prompted different styles of realizations for HT proposed in [26]. First breakthrough arrived with advent of efficient processors with memory hierarchy that induced innovation in efficient realization of Level3 BLAS [27]. Considering efficiency of Level3 BLAS in the processors at that juncture, there was a successful attempt to realize higher level routines in terms of Level3 operations. This attempt gave rise to LAPACK, a successor of LINPACK that uses matrixmatrix operations as a basic building block for realization of routines like LU, QR, and Cholesky factorizations [6]. Simultaneously, there was an innovation in HT that resulted in WYrepresentation of HT presented in [28]. The WYrepresentation proposed in [28] is storage efficient and exploits memory hierarchy of the underlying platform more efficiently due to heavy use of Level3 BLAS operations at comparable operations to its predecessor proposal in [29]. Since then WYrepresentation of HT is preferred due to its computational density and over the years there have been several realizations of HT on contemporary platforms like multicore, GPGPUs, and FPGAs. LAPACK_DGEQR2 was the first implementation that was dominant in matrixvector operations (also depicted in figure 2) while improved version of LAPACK_DGEQR2 is LAPACK_DGEQRF that performs block QR factorization based on HT rich in matrixmatrix operations.
In the recent years, with advent of multicore and GPGPUs, two major packages are developed namely Parallel Linear Algebra Software for Multicore Architectures (PLASMA) for multicore architectures and Matrix Algebra for GPU and Multicore Architectures (MAGMA) for heterogeneous computing environment [7][30]. Corresponding routine realized in PLASMA is PLASMA_DGEQRF that uses LAPACK_DGEQR2 and BLAS_DGEMM along with Queuing and Runtime for Kernels (QUARK) for efficient realization of HT based QR factorization on multicore platforms [31]. Similarly, MAGMA uses efficiently realized MAGMA_BLAS and MAGMA_DGEQR2 for realization of MAGMA_DGEQRF. Software stacks of PLASMA and MAGMA are explained in section 4.1. There have been several attempts for scalable parallel realization of HT based QR factorization on FPGAs [32][33]. The approach presented in [33] is a multicore based approach emulated on FPGA while LAPACKrc presented in [32] is a scalable parallel realization of LU, QR and Cholesky factorizations. FPGA based realizations clearly outperform multicore and GPGPU based realizations. A major drawback of FPGA based realization is the energy efficiency of the final solution. It is shown in [23] that CFU tailored for DLA computations clearly outperforms multicore, GPGPU, and FPGA based solutions and is an ideal platform for our experiments. Surprisingly, multicore, GPGPU, and FPGA based solutions for HT based QR factorization dwell on efficient exploitation of memory hierarchy but none of them focus on rearrangement of computations to expose higher degree of parallelism in HT. In this paper, we present modification to classical HT where we maintain same computation count and memory access of classical HT. We also realize proposed MHT in PLASMA and MAGMA and show marginal improvement in multicore and no improvement in GPGPU.
3 Case Studies
We present case study on different available realizations of DGEMM, DGEQR2, and DGEQRF in LAPACK, PLASMA, and MAGMA, and discuss results on multicore and GPGPU platforms.
3.1 Dgeqr2
Pseudo code of DGEQR2 is described in algorithm 2. It can be observed in the pseudo code in the algorithm 2 that, it contains three steps, 1) computation of a householder vector for each column 2) computation of householder matrix , and 3) update of trailing matrix using (from equation 2). CyclesperInstruction (CPI) attained for LAPACK_DGEQR2 executed on commercially available microarchitectures is shown in figure 7(a). For our experiments, we use Intel C Compiler (ICC) and Intel Fortran Compiler (IFORT). We also use different compiler switches to improve the performance of LAPACK_DGEQR2 on Intel microarchitectures. It can be observed in the figure 7(a) that in Intel Core i7 Gen machine which is a Haswell microarchitecture, CPI attained saturates at . It can be observed in figure 7(b) that attained Gflops saturates at 3 Gflops. Similarly, it can be observed that the percentage of peak performance saturates at 78% of the peak performance in case of Intel Haswell microarchitectures as observed in figure 7(d). In case when compiler switch is used that enables use of Advanced Vector Extensions (AVX) instructions, the CPI attained is increased. This behavior is due to AVX instructions that use Fused Multiply Add (FMA). Due to this fact, the CPI reported by VTune™can not be considered as a measure of performance for the algorithms and hence we accordingly double the instruction count reported by VTune™.
In case of GPGPUs, MAGMA_DGEQR2 is able to achieve up to 16 Gflops in Tesla C2050 which is 3.1% of the theoretical peak performance of Tesla C2050 while performance in terms of Gflops/watt is as low as 0.04 Gflops/watt.
3.2 Dgemm
Pseudo code for BLAS_DGEMM routine in BLAS is shown in algorithm 3. It can be observed in the algorithm 3 that DGEMM has three nested loops in the algorithm. DGEMM is Level3 BLAS and it has applications in realization of block algorithms in the DLA software packages since computations in DGEMM are regular in nature and easy to parallelize. CPI attained in BLAS_DGEMM is 0.37 as shown in figure 7(a) for Intel Haswell microarchitectures. We have used Netlib BLAS for our experiments with platform specific compiler and all the compiler optimizations enabled. BLAS_DGEMM is able to achieve up to 8.5 Gflops in Intel Haswell microarchitectures with all optimizations as shown in figure 7(b) which is 17% of the theoretical peak performance of the micro architecture as depicted in figure 7(d). In case of GPGPU, MAGMA_DGEMM is able to achieve up to 295 Gflops in Nvidia Tesla C2050 which is 57% of the peak performance as shown in the figures 7(c) and figure 7(d) respectively. Interms of Gflops/watt, LAPACK_DGEMM is capable of attaining 0.12 Gflops/watt in Intel Haswell micro architecture while for Tesla C2050, it is 1.21 Gflops/watt in MAGMA_DGEMM.
3.3 Dgeqrf
Pseudo code for DGEQRF routine is shown in algorithm 4. In terms of computations, there is no difference between algorithms 2, and 4. In a single core implementation, LAPACK_DGEQRF is observed to be 23x faster than LAPACK_DGEQR2. The major source of efficiency in LAPACK_DGEQRF is efficient utilization of processor memory hierarchy and BLAS_DGEMM routine which is a compute bound operation [34][35]. CPI attained in LAPACK_DGEQRF is 0.43 as shown in figure 7(a) which is much lower than the CPI attained by LAPACK_DGEQR2. Interms of Gflops, LAPACK_DGEQRF is 23x better than LAPACK_DGEQR2 as shown in figure 7(b) while the performance attained by LAPACK_DGEQRF is 85% of the performance attained by LAPACK_DGEMM. LAPACK_DGEQRF achieves 67 Gflops in Intel Haswell microarchitecture as shown in figure 7(b). In Nvidia Tesla C2050, MAGMA_DGEQRF is able to achieve up to 265 Gflops as shown in figure 7(c) which is 51.4 % of theoretical peak performance of Nvidia Tesla C2050 as shown in the figure 7(d) which is 90.5% of the performance attained by MAGMA_DGEMM. In MAGMA_DGEQR2, performance attained in terms of Gflops/watt is as low as 0.05 Gflops/watt while for MAGMA_DGEMM and MAGMA_DGEQRF it is 1.21 Gflops/watt and 1.09 Gflops/watt respectively in Nvidia Tesla C2050 as shown in figure 7(e). In case of PLASMA_DGEQRF, the performance attained is 0.39 Gflops/watt while running PLASMA_DGEQRF for four cores.
Based on our empirical case studies of DGEQR2, DGEMM, and DGEQRF, we make following observations.

Due to presence of bandwidth bound operations like DGEMV in DGEQR2. the performance of DGEQR2 is not satisfactory in Intel or Nvidia microarchitectures.

Despite presence of compute bound operations like DGEMM in LAPACK_DGEQRF, PLASMA_DGEQRF, and MAGMA_DGEQRF, these routines are able to achieve only 816% of the peak Gflops in Intel Haswell and 51.5% of the peak Gflops in Nvidia Tesla C2050 respectively.

Performance achieved by DGEQR2, DGEMM, and DGEQRF in Intel Haswell and Nvidia C2050 is as low as 0.051.23 Gflops/watts
Based on above observations, we see a scope in optimization of DGEQR2 and furthermore improvement in DGEQRF routines in LAPACK. In the next section, we continue our quest for optimizations of these routines for commercially available microarchitectures.
4 Modified Householder Transform
In this section, we revisit the classical HT described in algorithm 2 and look for further tuning of the algorithm assuming infinite memory bandwidth and infinite number of arithmetic units required for computing matrix. It can be observed in the algorithm 2 that the computations of Householder vector is Level1 BLAS operation and there is no further scope for optimization in computation of Householder vector. The computations that are dependent on the computation of Householder vector are computation of matrix which is a Householder matrix and trailing matrix update of the input matrix . In classical HT, the trailing matrix update is performed by premultiplying matrix with the Householder matrix as shown in equations 5 and 11.
(4)  
(5) 
Equation 5 in algorithm form is shown in 5. It can be observed in the algorithm 5 that the computation of and computation of can be merged. Routine where we merge these two loops is shown in algorithm 6.
MHT for matrix is shown in figure 8. It can be observed from the figure 6 and 8 that due to fusing of the inner and intermediate loops in MHT, the depth of the graph decreases. It can also be observed that there are more operations per level in the graph. If we take number of operations per level in the DAG of HT as then average is given by equation 6.
(6) 
In case of MHT, since there is a decrease in the number of levels in the DAG of the algorithm, where is for MHT and is for HT. Quantifying for HT and MHT,
(7)  
(8) 
Considering ratio of and ,
(9)  
(10) 
The parameter in equation 9 is ratio of quantified parallelism in HT and MHT respectively. For our analysis is independent of the computations since there is no change in the computations and it is also independent of communication since the communication pattern remain identical in HT and MHT. As value of decreases the parallelism in MHT is more. For HT and MHT, saturates at 0.749 as shown in figure 9. The method used here to quantify the parallelism in the routine is simple since the computations are regular in nature. For complex algorithms, method described in [36] can be used. To support proposition of improvement in the parallelism in MHT, we experiment on several commercially available multicore and GPUs. For our experiments on multicore, we have used LAPACK available in the Netlib with vendor specific optimizations and we have ensured to use LAPACK program semantics while realizing and integrating realization of MHT in LAPACK [6]. For GPGPUs we have used highly tuned MAGMA package and ensured that the program semantics of MAGMA_DGEQR2HT confirms with MAGMA semantics for ease of integration with the package [30].
4.1 Realization on Multicore and GPGPU
We realize MHT on two different commercially available Intel Haswell and AMD Bulldozer microarchitectures as shown in figure 11(a). Figure 11(a) also depicts performance of DGEQRFHT where DGEQRFHT is blocked algorithm analogous to DGEQRF in LAPACK. Since PLSAMA is developed using LAPACK as shown in figure 10(a), we integrate LAPACK_DGEQRFHT in PLASMA for experiments.
Pseudo code of LAPACK_DGEQR2HT is shown in algorithm 7. The routine is implemented as dgeqr2_ht.f in the source code provided with this exposition. Pseudo code of BLAS_UPDATE function that is implemented as update1.f is shown in algorithm 8. It can be observed in the algorithms 7 and 8 that BLAS_UPDATE function forms a major computationally intensive part of LAPACK_DGEQR2HT. BLAS_UPDATE function becomes part of BLAS that is used inside LAPACK as a basic building block. The routines shown in algorithms 7 and 8 and their wrappers to integrate these routines in LAPACK are supplied with this exposition where directory structure of legacy LAPACK software package is maintained. LAPACK along with Queuing and Runtime for Kernels (QUARK) are used in the PLASMA software stack for multicore realization of LAPACK software package as depicted in figure 10(a) [7].
Similarly, for realization of MHT on GPGPU, we use MAGMA software stack described in figure 10(b). MHT when implemented for GPGPU is shown in algorithm 9.
It can be observed in the algorithm 9 that the computationally intensive part of MAGMA_DGEQR2HT is MAGMABLAS_UPDATE function shown in algorithm 10. Similar to BLAS_UPDATE in LAPACK_DGEQR2HT where BLAS_UPDATE is part of BLAS, MAGMABLAS_UDATE is part of MAGMABLAS. MAGMA_UPDATE is realized as a series of kernels in CUDA C shown in the algorithms 11, 12, 13, 14, 15, and 16.
It can be observed in figure 11(a) that in AMD Bulldozer microarchitecture LAPACK_DGEQR2HT performs better than LAPACK_DGEQR2, LAPACK_DGEQRF, and LAPACK_DGEQRFHT. The performance of LAPACK_DGEQRFHT and LAPACK_DGEQRF is observed to be same in AMD Bulldozer. In Intel Core i7 Gen which is a Haswell microarchitecture, the performance of LAPACK_DGEQRFHT and LAPACK_DGEQRF is same. Apart from that LAPACK_DGEQRFHT and LAPACK_DGEQRF perform around 10% better than LAPACK_DGEQR2HT. When we integrate LAPACK_DGEQR2HT in PLASMA, the attained performance is depicted in the figure 11(b) that results in PLASMA_DGEQRFHT since the trailing matrix update is using LAPACK_DGEMM along with QUARK. For integration of LAPACK_DGEQR2HT in PLASMA, we have replaced the instance of LAPACK_DGEQRF in PLASMA with the instance of LAPACK_DGEQR2HT. It can be observed in the figure 11(b) that the performance attained by PLASMA_DGEQRFHT is 10% worse than that of performance attained by PLASMA_DGEQRF.
Performance of MAGMA_DGEQR2HT, MAGMA_DGEQR2, MAGMA_DGEQRF, and MAGMA_DGEQRfHT is shown in figure 11(b). It can be observed that the performance of MAGMA_DGQER2 and MAGMA_DGEQR2HT is almost similar on Nvidia Tesla C2050 while the performance of MAGMA_DGEQRF and MAGMA_DGEQRFHT is also similar. Unlike Intel or AMD microarchitectures, performance of MHT in Nvidia is nowhere close to the performance of MAGMA_DGEQRF )or MAGMA_DGEQRFHT). This performance figures are not satisfactory since , and we expect that the performance achieved in LAPACK/PLASMA/MAGMA_DGEQR2HT is 1.11.3x better over LAPACK/PLASMA/MAGMA_DGEQR2. We also expect that LAPACK/PLASMA/MAGMA_DGEQR2HT outperforms LAPACK/PLASMA/MAGMA_DGEQRF. Due to lack of domain customizations for BLAS and LAPACK, we are not able to exploit parallelism that is available in MHT and hence we achieve marginally better performance in multicores while we achieve same performance in GPGPU for realization MHT compared to realization of HT. In this section, we presented modification to classical HT and arrived at MHT where significant improvement in is observed. We further see scope for improvement in realization of MHT where we identify macro operations in MHT and realize them on a specialized RDP that can efficiently execute identified macro operations.
5 Custom Realization of Householder Transform and Results
In this section we present custom realization for HT and show that through algorithmarchitecture codesign, performance of the algorithms can be improved significantly. We adopt methodology presented in [37], and [19], and design a Processing Element (PE) that is efficient in overlapping computations and communication. The PE presented here is also efficient in exploiting Instruction Level Parallelism (ILP) exhibited by BLAS [23][24]. We use this PE as a CFU for REDEFINE for parallel realization to show scalability of algorithms and architecture.
5.1 Processing Element Design and Algorithmarchitecture Codesign for HT
Design of PE is depicted in figure 12 and it is also shown . It can be observed that PE is separated into two modules: 1) Floating Point Sequencer (FPS), and 2) LoadStore CFU. All the double precision floating point computations are performed in the FPS while LoadStore CFU is responsible for loading/storing data from/to Global Memory (GM) to Local Memory (LM) and LM to Register File, where GM is next level of memory in parallel realization while LM is the private memory of PE, and Register File is small memory of registers [19][21]. Operation of PE can be described in following steps:

Step 1: Send a load request to GM for input matrices and store the arriving elements of the input matrices to LM

Step 2: Store input matrix elements from LM to Register File

Step 3: Perform computations in FPS

Step 4: Store the final/intermediate results from Register File to LM in the LoadStore CFU

Step 5: Store final result to GM
In our implementation of DGEQR2, DGEQRF, DGEQR2HT, and DGEQRFHT, we use similar mechanism. FPS has several resources to perform computations. In this exposition, we use carefully designed DOT4, a square root, and a divider for realization of DGEQR2, DGEQRF, DGEQR2HT, and DGEQRFHT routines [21][22]. Logical place of arithmetic units is shown in the figure 12 and structure of DOT4 is shown in figure 13.
DOT4 can perform inner product of a element vector. DOT4 is a reconfigurable datapath that can be reconfigured to act as different macro operations encountered in the algorithms [19]. In DGEQR2HT, due to fusion of inner most and intermediate loops, we identify a new macro operation apart from usual macro operations. We explain this with an example of a matrix . Applying DGEQR2HT to the matrix to annihilate and , we compute Householder matrix and premultiply with as shown in the equation 11.
(11) 
where is Householder vector as explained in section 2.
(12) 
where and .
Taking a close look at the expressions of the updated matrix in equation 12, we observe a new macro operation in the expression . For a matrix there are such macro operation encountered as shown in equation 12. We realize this macro operation on DOT4 hardware structure as shown in figure 13. For larger matrices, we break the expressions to fit on the DOT4 hardware structure. Performance after realization of DGEQR2, DGEQRF, DGEQR2HT, and DGEQRFHT is depicted in figure 14(a).
It can be observed in figure 14(a) that DGEQR2HT achieves 2x better performance over DGEQR2, 1.23x over DGEQRF, and 1.23x over DGEQRFHT. Interestingly, DGEQR2HT achieves close to 74% of the peak performance that is 99.3% of the performance attained by DGEMM in the PE as shown in figure 14(b) In terms of Gflops/watt DGEQR2HT achieves 2x better performance compared to DGEQR2, 1.23x over DGEQRF, and 1.2x over DGEQRFHT. Surprisingly, compared to some of the stateoftheart realizations of HT based QR factorization, the proposed MHT based QR factorization attains 390x better performance which is better than the performance improvement reported for DGEMM in [23]. Such an unexpected result is attained since for the stateoftheart platforms, the performance attained by LAPACK/PLASMA/MAGMA_DGEQRF is mostly 8085% of the peak performance attained by DGEMM in the platform while in the PE the performance attained by MHT based QR factorization is 99.3% of the performance attained by DGEMM.
5.2 Parallel Implementation of HT
For parallel realization of DGEQR2, DGEQRF, DGEQR2HT, and DGEQRFHT routines, we use simulation environment depicted in figure 12. We attach PE to the Routers in the Tiles of REDEFINE except the last column. In the last column, we attach CFUs with memories that contain same address space. We use this memory as Global Memory (GM).
To show scalability, for experiments, we consider three configurations with different sizes of Tile arrays like , , and . Two configurations are shown in figure 15 namely and wherein is composed of Tile array as a fabric for computations and is composed of Tile array as a fabric for computations. Matrix partitioning schemes for different configurations is also depicted in the figure 15 along with configurations. In the matrix partitioning scheme, we have followed an approach that is used in [7] where input matrix is divided into submatrix blocks. For a fabric of computations and matrix size, we divide matrix in to the blocks of submatrices. Since, objective of our experiments is to show scalabillity, we choose and such that . Results for parallel implementation of DGEQR2, DGEQRF, DGEQR2HT, and DGEQRFHT routines are depicted in the figure 14(e). It can be observed in the figure 14(f) that the speedup in parallel realization of DGEQR2, DGEQRF, DGEQR2HT, and DGEQRFHT approaches when realized using Tile array of size . In figure 14(e) speedup attained in parallel realization of any routine is the speedup over corresponding sequential realizations of the routines. Percentage of theoretical peak performance attained by DGEQR2, DGEQRF, DGEQR2HT, and DGEQRFHT is shown in figure 14(f). It can be observed in the figure 14(f) that DGEQR2HT is capable of attaining 66% of the theoretical peak of the Tile array utilized for computations while DGEQR2 attains 16%, DGEQRF attains 39.5%, and DGEQRFHT attain 42% of the theoretical peak performance in REDEFINE. DGEQR2HT in REDEFINE clearly outperforms all other routines as depicted in the figure 14(f) while REDEFINE scales well for DGEQR2, DGEQRF, DGEQR2HT, and DGEQRFHT.
6 Conclusion
Performance attained by Householder Transform based QR factorization in the stateoftheart multicore and GPGPUs is usually 8085% of the performance attained by General Matrix Multiplication. In this paper, we achieved performance in Modified Householder Transform similar to the performance of General Matrix Multiplication in terms of Gflops which is contrary to the performance attained in the conventional multicore and GPGPU platforms with the stateoftheart software packages for Dense Linear Algebra. We moved away from classical approach where optimized Basic Linear Algebra Subprograms are used for realization of Householder Transform based QR factorization and fused inner loops in the Householder Transform based QR factorization based routine (DGEQR2 routine) that resulted in higher degree of parallelism. A simplistic approach for quantification of parallelism was adopted to show 1.3 times higher parallelism in Modified Householder Transform. The classical Householder Transform based QR factorization and Modified Householder Transform based QR factorization along with their optimized block implementations were evaluated on the stateoftheart multicore and GPGPU where it was observed that the parallelism exhibited by Modified Householder Transform is not fully exploited by these platforms. Design of an existing Processing Element was amended to support the macro operations encountered in Modified Householder Transform by adding a new configuration in the Reconfigurable Datapath in the Processing Element. The approach of realizing macro operations on a Reconfigurable Datapath resulted in 2x performance improvement in Modified Householder Transform over classical Householder Transform in the PE. Realization of Modified Householder Transform could also outperform custom realization of block Householder Transform based QR factorization by 1.21.3x. Performance improvement of 380x is reported in terms of Gflops/watt over multicore, GPGPU, FPGA, and ClearSpeed CSX700. The performance improvement reported is higher than that of General Matrix Multiplication due to counterintuitive results obtained in Modified Householder Transform. Finally, it is shown that Processing Element as a Custom Function Unit in REDEFINE results in scalable high performance realization of Householder based and Modified Householder based QR factorization.
References
 [1] G. H. Golub and C. F. Van Loan, Matrix computations (3rd ed.). Baltimore, MD, USA: Johns Hopkins University Press, 1996.
 [2] C. Thornton and G. Bierman, “Givens transformation techniques for kalman filtering,” Acta Astronautica, vol. 4, no. 7–8, pp. 847 – 863, 1977. [Online]. Available: http://www.sciencedirect.com/science/article/pii/0094576577900170
 [3] H. Rutishauser, “Computational aspects of f. l. bauer’s simultaneous iteration method,” Numerische Mathematik, vol. 13, no. 1, pp. 4–13, 1969. [Online]. Available: http://dx.doi.org/10.1007/BF02165269

[4]
A. Joseph and B. Yu, “Impact of regularization on spectral clustering,” in
2014 Information Theory and Applications Workshop (ITA), Feb 2014, pp. 1–2.  [5] G. Strang, Linear Algebra and Its Applications. Brooks Cole, February 1988. [Online]. Available: http://www.amazon.ca/exec/obidos/redirect?tag=citeulike0920{&}path=ASIN/0155510053
 [6] E. Anderson, Z. Bai, J. Dongarra, A. Greenbaum, A. McKenney, J. Du Croz, S. Hammerling, J. Demmel, C. Bischof, and D. Sorensen, “Lapack: A portable linear algebra library for highperformance computers,” in Proceedings of the 1990 ACM/IEEE Conference on Supercomputing, ser. Supercomputing ’90. Los Alamitos, CA, USA: IEEE Computer Society Press, 1990, pp. 2–11. [Online]. Available: http://dl.acm.org/citation.cfm?id=110382.110385
 [7] J. Kurzak, P. Luszczek, A. YarKhan, M. Faverge, J. Langou, H. Bouwmeester, and J. Dongarra, “Multithreading in the plasma library,” 2013.
 [8] J. L. Hennessy and D. A. Patterson, Computer Architecture, Fifth Edition: A Quantitative Approach, 5th ed. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2011.
 [9] N. Satish, C. Kim, J. Chhugani, H. Saito, R. Krishnaiyer, M. Smelyanskiy, M. Girkar, and P. Dubey, “Can traditional programming bridge the ninja performance gap for parallel computing applications?” Commun. ACM, vol. 58, no. 5, pp. 77–86, 2015. [Online]. Available: http://doi.acm.org/10.1145/2742910
 [10] S. Kestur, J. D. Davis, and O. Williams, “Blas comparison on fpga, cpu and gpu,” in Proceedings of the 2010 IEEE Annual Symposium on VLSI, ser. ISVLSI ’10. Washington, DC, USA: IEEE Computer Society, 2010, pp. 288–293. [Online]. Available: http://dx.doi.org/10.1109/ISVLSI.2010.84
 [11] L. Hu, X. Che, and S.Q. Zheng, “A closer look at gpgpu,” ACM Comput. Surv., vol. 48, no. 4, pp. 60:1–60:20, Mar. 2016. [Online]. Available: http://doi.acm.org/10.1145/2873053
 [12] A. Pedram, S. Z. Gilani, N. S. Kim, R. A. van de Geijn, M. J. Schulte, and A. Gerstlauer, “A linear algebra core design for efficient level3 blas,” in ASAP, 2012, pp. 149–152.
 [13] A. Pedram, R. A. van de Geijn, and A. Gerstlauer, “Codesign tradeoffs for highperformance, lowpower linear algebra architectures,” IEEE Trans. Computers, vol. 61, no. 12, pp. 1724–1736, 2012. [Online]. Available: http://doi.ieeecomputersociety.org/10.1109/TC.2012.132
 [14] Z. E. Rákossy, F. Merchant, A. A. Aponte, S. K. Nandy, and A. Chattopadhyay, “Efficient and scalable cgrabased implementation of columnwise givens rotation,” in ASAP, 2014, pp. 188–189.
 [15] M. Alle, K. Varadarajan, A. Fell, R. R. C., N. Joseph, S. Das, P. Biswas, J. Chetia, A. Rao, S. K. Nandy, and R. Narayan, “REDEFINE: Runtime reconfigurable polymorphic asic,” ACM Trans. Embed. Comput. Syst., vol. 9, no. 2, pp. 11:1–11:48, Oct. 2009. [Online]. Available: http://doi.acm.org/10.1145/1596543.1596545
 [16] A. Fell, P. Biswas, J. Chetia, S. K. Nandy, and R. Narayan, “Generic routing rules and a scalable access enhancement for the networkonchip reconnect,” in SoCC, 2009, pp. 251–254.

[17]
F. Merchant, A. Chattopadhyay, G. Garga, S. K. Nandy, R. Narayan, and N. Gopalan, “Efficient QR decomposition using low complexity columnwise givens rotation (CGR),” in
2014 27th International Conference on VLSI Design and 2014 13th International Conference on Embedded Systems, Mumbai, India, January 59, 2014, 2014, pp. 258–263. [Online]. Available: http://dx.doi.org/10.1109/VLSID.2014.51  [18] S. Das, K. T. Madhu, M. Krishna, N. Sivanandan, F. Merchant, S. Natarajan, I. Biswas, A. Pulli, S. K. Nandy, and R. Narayan, “A framework for postsilicon realization of arbitrary instruction extensions on reconfigurable datapaths,” Journal of Systems Architecture  Embedded Systems Design, vol. 60, no. 7, pp. 592–614, 2014. [Online]. Available: http://dx.doi.org/10.1016/j.sysarc.2014.06.002
 [19] F. Merchant, A. Maity, M. Mahadurkar, K. Vatwani, I. Munje, M. Krishna, S. Nalesh, N. Gopalan, S. Raha, S. Nandy, and R. Narayan, “Microarchitectural enhancements in distributed memory cgras for lu and qr factorizations,” in VLSI Design (VLSID), 2015 28th International Conference on, Jan 2015, pp. 153–158.
 [20] Z. E. Rákossy, F. Merchant, A. A. Aponte, S. K. Nandy, and A. Chattopadhyay, “Scalable and energyefficient reconfigurable accelerator for columnwise givens rotation,” in 22nd International Conference on Very Large Scale Integration, VLSISoC, Playa del Carmen, Mexico, October 68, 2014, 2014, pp. 1–6. [Online]. Available: http://dx.doi.org/10.1109/VLSISoC.2014.7004166
 [21] F. Merchant, N. Choudhary, S. K. Nandy, and R. Narayan, “Efficient realization of table lookup based double precision floating point arithmetic,” in 29th International Conference on VLSI Design, VLSID 2016, Kolkata, India, January 48, 2016.
 [22] F. Merchant, A. Chattopadhyay, S. Raha, S. K. Nandy, and R. Narayan, “Accelerating BLAS and LAPACK via efficient floating point architecture design,” CoRR, vol. abs/1610.08705, 2016. [Online]. Available: http://arxiv.org/abs/1610.08705
 [23] F. Merchant, T. Vatwani, A. Chattopadhyay, S. Raha, S. K. Nandy, and R. Narayan, “Accelerating BLAS on custom architecture through algorithmarchitecture codesign,” CoRR, vol. abs/1610.06385, 2016. [Online]. Available: http://arxiv.org/abs/1610.06385
 [24] ——, “Achieving efficient qr factorization by algorithmarchitecture codesign of householder transformation,” in 29th International Conference on VLSI Design, VLSID 2016, Kolkata, India, January 48, 2016.
 [25] M. D. Hill and M. R. Marty, “Amdahl’s law in the multicore era,” Computer, vol. 41, no. 7, pp. 33–38, Jul. 2008. [Online]. Available: http://dx.doi.org/10.1109/MC.2008.209
 [26] A. S. Householder, “Unitary triangularization of a nonsymmetric matrix,” J. ACM, vol. 5, no. 4, pp. 339–342, Oct. 1958. [Online]. Available: http://doi.acm.org.ezlibproxy1.ntu.edu.sg/10.1145/320941.320947
 [27] J. Dongarra, J. D. Croz, S. Hammarling, and I. S. Duff, “A set of level 3 basic linear algebra subprograms,” ACM Trans. Math. Softw., vol. 16, no. 1, pp. 1–17, 1990. [Online]. Available: http://doi.acm.org/10.1145/77626.79170
 [28] R. Schreiber and C. V. Loan, “A storageefficient $wy$ representation for products of householder transformations,” SIAM Journal on Scientific and Statistical Computing, vol. 10, no. 1, pp. 53–57, 1989. [Online]. Available: http://dx.doi.org/10.1137/0910005
 [29] R. Schreiber and B. Parlett, “Block reflectors: Theory and computation,” SIAM Journal on Numerical Analysis, vol. 25, no. 1, pp. 189–205, 1988. [Online]. Available: http://dx.doi.org/10.1137/0725014
 [30] B. J. Smith, “R package magma: Matrix algebra on gpu and multicore architectures, version 0.2.2,” September 3, 2010, [Online] http://cran.rproject.org/package=magma.
 [31] A. YarKhan, J. Kurzak, and J. Dongarra, “Quark users’ guide: Queueing and runtime for kernels,” Innovative Computing Laboratory, University of Tennessee, Tech. Rep., 2011.
 [32] J. Gonzalez and R. C. Núñez, “Lapackrc: Fast linear algebra kernels/solvers for fpga accelerators,” Journal of Physics: Conference Series, vol. 180, no. 1, p. 012042, 2009. [Online]. Available: http://stacks.iop.org/17426596/180/i=1/a=012042
 [33] Y.G. Tai, C.T. D. Lo, and K. Psarris, “Scalable matrix decompositions with multiple cores on {FPGAs},” Microprocessors and Microsystems, vol. 37, no. 8, Part B, pp. 887 – 898, 2013, embedded Multicore Systems: Architecture, Performance and Application. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0141933112001007
 [34] J. A. Gunnels, C. Lin, G. Morrow, and R. A. van de Geijn, “A flexible class of parallel matrix multiplication algorithms,” in IPPS/SPDP, 1998, pp. 110–116. [Online]. Available: http://dx.doi.org/10.1109/IPPS.1998.669898
 [35] J. Wasniewski and J. Dongarra, “High performance linear algebra package for FORTRAN 90,” in Applied Parallel Computing, Large Scale Scientific and Industrial Problems, 4th International Workshop, PARA ’98, Umeå, Sweden, June 1417, 1998, Proceedings, 1998, pp. 579–583. [Online]. Available: http://dx.doi.org/10.1007/BFb0095385
 [36] G. G. Lee, H.Y. Lin, C.F. Chen, and T.Y. Huang, “Quantifying intrinsic parallelism using linear algebra for algorithm/architecture coexploration,” IEEE Trans. Parallel Distrib. Syst., vol. 23, no. 5, pp. 944–957, May 2012. [Online]. Available: http://dx.doi.org/10.1109/TPDS.2011.230
 [37] M. Mahadurkar, F. Merchant, A. Maity, K. Vatwani, I. Munje, N. Gopalan, S. K. Nandy, and R. Narayan, “Coexploration of NLA kernels and specification of compute elements in distributed memory cgras,” in XIVth International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation, SAMOS 2014, Agios Konstantinos, Samos, Greece, July 1417, 2014, 2014, pp. 225–232. [Online]. Available: http://dx.doi.org/10.1109/SAMOS.2014.6893215
Comments
There are no comments yet.