Towards a Multi-array Architecture for Accelerating Large-scale Matrix Multiplication on FPGAs

by   Junzhong Shen, et al.

Large-scale floating-point matrix multiplication is a fundamental kernel in many scientific and engineering applications. Most existing work only focus on accelerating matrix multiplication on FPGA by adopting a linear systolic array. This paper towards the extension of this architecture by proposing a scalable and highly configurable multi-array architecture. In addition, we propose a work-stealing scheme to ensure the equality in the workload partition among multiple linear arrays. Furthermore, an analytical model is developed to determine the optimal design parameters. Experiments on a real-life convolutional neural network (CNN) show that we can obtain the optimal extension of the linear array architecture.



There are no comments yet.


page 1

page 2

page 3

page 4


General Matrix-Matrix Multiplication Using SIMD features of the PIII

Generalised matrix-matrix multiplication forms the kernel of many mathem...

Search for Optimal Systolic Arrays: A Comprehensive Automated Exploration Framework and Lessons Learned

Systolic arrays have been widely used for accelerating HPC and deep lear...

GNA: new framework for statistical data analysis

We report on the status of GNA — a new framework for fitting large-scale...

Large Scale Artificial Neural Network Training Using Multi-GPUs

This paper describes a method for accelerating large scale Artificial Ne...

Systolic Tensor Array: An Efficient Structured-Sparse GEMM Accelerator for Mobile CNN Inference

Convolutional neural network (CNN) inference on mobile devices demands e...

Lower Bounds on Rate of Convergence of Matrix Products in All Pairs Shortest Path of Social Network

With the rapid development of social network applications, social networ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Large-scale floating-point matrix multiplication is widely used in many complicated computation tasks such as scientific computing [1]

and deep learning

[2]. Recently, field-programmable gate arrays (FPGAs) have become a particularly attractive option for accelerating large-scale matrix multiplication due to their reconfigurability and abundant logic resources. Previous studies [3, 4, 5, 6, 7, 8, 9] have primarily focused on accelerating matrix multiplication on FPGA by using an efficient architecture, i.e. the one-dimensional systolic array. This architecture was demonstrated successfully in matrix multiplication acceleration, which contributes to low bandwidth requirement and good scalability of the accelerator designs.

In this paper, we focus on the extension of the linear array architecture. According to our studies, there exist three approaches to extend the linear array architecture: 1. increasing the length of the linear array; 2. adopting multiple parallel linear arrays; 3. combining approaches 1 and 2. As a result, the design space is significantly expanded, which make it more difficult to reach the optimal solution than previous work. In addition, the computation efficiency of the accelerator is hard to be ensured if we adopt a fixed architecture for various problem sizes. This paper addresses these challenges by proposing a highly configurable and scalable multi-array architecture, which allows users to dynamically change the number of PEs in used as well as the number of parallel PE arrays. In addition, a work-stealing scheme is adopted to guarantee the equality of the workloads partition among the PE arrays. To determine the optimal design option for the proposed architecture, we also propose an analytical model to evaluate the realistic memory bandwidth as well as quantifying data traffic volumes.

The rest of this paper is organized as follows. We review the background in Section II. The proposed multi-array architecture is illustrated in Section III. Section IV presents the performance model. The experimental results are presented in Section V. Section VI concludes this paper.

Fig. 1: An overview of our proposed architecture.

Ii Background

In this paper, we focus on the dense matrix multiplication (MM) algorithm: , where matrix A , B and C , respectively. Equation 1 describes the computation pattern of this algorithm:


Since it is infeasible to store the entire input and output matrices on the FPGA, abundant researches [6, 5, 4, 10, 7, 11, 8] adopt a similar block matrix multiplication algorithm in their designs. Here, we introduce the block matrix multiplication algorithm in Dou [6], which has been proved to be successful in matrix multiplication acceleration. Initially, matrix A is split into sub-blocks (namely ) of size and matrix B is partitioned into sub-blocks (namely ) of size . In this way, the result matrix C can be calculated by performing sub-block matrix multiplications on the and , where and . The basic idea of this algorithm is to split the multiplication of and

into multiple inner-product operations of two vectors, i.e

and , where is the column of and is the row of (). Here we define as the product of and , then can be calculated by accumulating , ,.., and iteratively, shown as


Iii Proposed Architecture

Fig. 1 presents an overview of our proposed architecture for acceleration of matrix multiplication. Due to limited amount of on-chip memory on FPGAs, the source data and final results are stored in the off-chip memory (i.e. the DDR). It can be seen that the accelerator is composed of several modules, namely the Memory Access Controller (MAC), the Workload Queue Management (WQM), and the Matrices Processing Engine (MPE).

Iii-a MPE Design

The MPE is the kernel computation module of our architecture. As shown in Fig. 1, the MPE module consists of several linear arrays of PEs, in addition with some multiplexers placed between adjacent PE arrays.

We apply two operation modes in the two adjacent PE arrays, namely the Independent mode and the Cooperation mode. In the Independent mode, the multiplexer between the PE arrays is disabled, meaning that the PE arrays can execute computation tasks independently without any data communication. While in the Cooperation mode, the multiplexer between the PE arrays is enabled. As a result, the data paths of the PE arrays are connected by the multiplexer. As shown in Fig. 1, the PE array that placed behind a multiplier can fetch data from the proceeding PE array in this mode. In Cooperation mode, the required memory bandwidth of the PE arrays is lower since the PE arrays share the same memory interface when they are connected. In addition, larger block sizes can be supported in the Cooperation mode since the number of PEs in the connected array has increased. Note that the multipliers are initialized by the host CPU, meaning that the multi-array architecture is highly configurable. More importantly, our architecture preserves the scalability of the linear array architecture.

The fully pipelined structure of PE is presented in the right part of Fig. 1. It can be seen that the PE consists of two sets of data registers for input data buffering, three First-In-First-Outs (FIFOs) for delivering data between PEs, local memory for temperate data storing, and floating-point multiply-and-accumulate unit (). Different from previous studies, we implement additional control units to support arbitrary block size. In addition, we implement a phase synchronization unit () to guarantee the correctness of the final results when the block sizes for matrices A and B are different. By conditionally inserting stalls into the computation pipeline of the PE, the ensures that the column of and row of are fetched into each PE simultaneously. The dataflow in each PE consists of three stages:

Prefetch. In this stage, the PE picks up the corresponding element in (i.e. the first column of ) based on the PE identifier (PID), then stores the data into the local register . For instance, with will picks up the second element in .

Compute. In this stage, the row of (i.e. ) and the column of (i.e. ) are fetched into the PE simultaneously, where . The buffered element in is multiplied with all the elements of in order. Therefore, the data buffered in is reused times. In the meantime, the PE also buffers the corresponding data in into . Note that we apply double buffering in to overlap buffering data of the next iteration and computation of the current iteration. The products of the multipliers in FMAC are then added with the intermediate results generated in the previous iteration, which are stored in the local memory . Finally, the newly sums are written back into the memory . Note that in the last iteration (i.e. ), the final results are written into FIFO instead of memory .

Write back. In this stage, the PE sends its local results to the proceeding PE or the MAC module (only ) from the . As a results, the result data are delivered from the end of each independent PE arrays to the MAC module.

Iii-B WQM Design

The WQM module is responsible for workloads assignment for the PE arrays. It manages multiple workload queues to buffer the computation tasks for the PE arrays. Note that one workload queue corresponds to one PE array. For the proposed multi-array architecture, the steadiness of an even partition of workloads among PE arrays is the key to achieve better performance. However, it is difficult to ensure that the workloads are always equally partitioned during the whole computation procedures of PE arrays. For example, sometime the PE array which is assigned with less workloads could obtain higher memory bandwidth, which may worsen the inequality of workloads partition. As a result, the system performance would bottlenecked by the PE array with the most workloads. To address this issue, we adopt the work-stealing scheme [12] in the design of the WQM module.

The basic idea of the work-stealing scheme is to enable an idle PE array to acquire computation tasks from the overloaded PE array(s).

Fig. 2: Illustration of our proposed work-stealing scheme.

Fig. 2 depicts the working procedure of the work-stealing scheme. Note that a controller (omitted in Fig. 2) is designed to manage the workload delivery among the workload queues. It can be seen that a counter is implemented to record the number of tasks for each workload queue. Once the controller detects that a workload queue becomes empty, it will immediately steal a task from a non-empty queue, then load it into the empty workload queue. If there exist more than one non-empty queues, the controller will select the workload queue that with the most workloads as target, by comparing the corresponding counters of the workload queues. It is important to note that we implement a round-robin arbiter in the controller to arbitrate multiple concurrent work-stealing requests. The controller repeats the detection and arbitration during the entire computation procedure of the PE arrays.

Iii-C MAC Design

The MAC module is responsible for managing data transfer between the external memory and the accelerator. As shown in Fig. 2, the workloads executed by the MAC module are organized by a self-defined data structure named buffer descriptor. A buffer descriptor contains the following parameters: specifies the memory locations that store the sub-matrices;

specifies the stride of each memory transfer;

specifies the block sizes and specifies the iteration ().

As mentioned in the above context, elements of matrix A are fetched into the PE arrays in column-major order. However, the matrix A is stored in row-major order. Therefore, the access of matrix A may cause inefficient memory bandwidth utilization. To improve the effective memory bandwidth, we transpose matrix A to allow its data to be fetched in row-major order. In this way, the burst transfer mode that favored by the external memory can be used to access both matrices A and B. As a result, the memory bandwidth for the accelerator is significantly improved, which contributes to performance improvement of the overall system.

Iv Performance Modeling

In this section, we will illustrate how to determine the optimal solution of mapping the block matrix multiplication algorithm onto the multi-array architecture.

Let the bandwidth of the off-chip memory be (MB/s), the number of PE in a single PE array be (when all the multiplexers are disabled), the number of PE arrays work in parallel be , block size of the matrix A (on rows) be and block size of the matrix B (on columns) be . For A of size and B of size , the average number of sub-block matrix multiplications performed by each PE array can be expressed as


Note that we pad matrices

A and B with zeros if and are not integer multiples of and . In addition, the time (in seconds) taken to load a workload (i.e. and ) and write back the corresponding can be calculated by


To simplify the model, we assume that all the workloads are equally partitioned. Therefore, the time taken to transfer data between the external memory and the PE arrays can be expressed as


According to the data path described in section III, the computation time (in seconds) of a single PE array can be determined as follows:


where denotes the stages of the computation pipeline in each PE, and

is the working frequency of the accelerator. Since the memory access and computation process are overlapped in our architecture, it is difficult to directly estimate the execution time of the accelerator. However, the lower bound and upper bound of the execution time

can be determined by


To simplify the discussion on the parameters that affect , we assume for the rest of this paper. It can be inferred that the attainable memory bandwidth for each PE array is mainly affected by and , which can be expressed by


This is because determines te burst length of memory access, and affects the conflicts of memory accesses of the PE arrays. From equations 3, 4, 5, 6 and 8, it can be seen that and are the key factors that affect the performance of our accelerator. To reduce the size of design space, we consider the constraints on and . We observe that there exists a relationship between and . For better understanding, we denote as the maximum number of the independent PE arrays (when all the multiplexers are disabled), then the relationship between and can be determined as


To this end, the size of design space can be reduced with the assistance of equation 9. Given the fixed problem size and (i.e. the total number of PEs), the proposed analytical model can be used to determine the optimal that minimizes the range of .

V Experimental Results

The FPGA platform used in our experiment is the Xilinx VC709 board, which equips with a XC7VX690T FPGA and two DDR3 DRAMs. In addition, all synthesized results are obtained from Xilinx Vivado 2016.4.


DSP48Es BRAMs Flip-Flops LUTs
Utilization 1032 560.50 292016 192493
percentage(%) 28.67 38.13 33.70 44.44
TABLE I: Post-synthesis resource utilization.

In order to quantify function , we evaluate the average effective memory bandwidth of a PE array in terms of block sizes and number of PE arrays. As shown in Fig. 3, two observations can be found. First, the effective memory bandwidth goes up with the increase of block size. Second, the effective bandwidth declines when we increase the number of PE arrays.

Fig. 3: Effective memory bandwidth.
Fig. 4: Comparison of the estimated execution time and actual execution time for conv-2.

As a case study, we use a real-life CNN model, AlexNet [13], to validate our analytical model. Note that AlexNet comprises of five convolutional layers and three fully connected layers, and the computation patterns of these layers can be converted to matrix multiplication [14]. Note that we are mainly focus on determining the optimal design parameters under fixed and , therefore we do not make full use of the resource of the FPGA to pursuit maximum performance. In our experiment, we set and . After post-synthesis, a maximum frequency of 200 MHz () is achieved. Table I summarized the resource utilization of the overall system. It can be seen that the overall resource utilization is below 50%, which contributes to the high frequency of our accelerator.

We give a detailed comparison between the predicted and actual execution time for conv-2 in Fig. 4. It can be seen that the predicted lower bound of execution time closely follows the actual measurement when the memory requirement of each PE array is satisfied. However, when the memory bandwidth requirement is unsatisfied, the actual time of each PE array becomes more close to the upper bound of predicted execution time. In addition, it can also be found that using multiple PE arrays does not ensure the optimal performance. For example, the case of achieves lower execution time than the case of . The main reason is that both of the cases are memory-bound ( 1.6 GB/s), and the case of can reach higher memory bandwidth (it can be confirmed by Fig. 3), which contributes to its higher performance.

Layers Optimal Performance (GFLOPS)
Conv-1 96*363*3025 (2,128) 59.7 57.1 49.2
Conv-2 128*1200*729 (2,128) 87.8 70.3 61.4
Conv-3 384*2304*169 (2, 96) 64.9 62.9 57.4
Conv-4 192*1728*169 (2, 96) 64.1 54.8 51.2
Conv-5 128*1728*169 (2,128) 62.9 44.9 43.9
fc-6 128*9216*4096 (2,128) 100.9 79.3 70.7
fc-7 128*4096*4096 (2,128) 99.3 78.1 69.5
fc-8 128*4096*1000 (2,128) 96.9 83.6 67.8
TABLE II: Optimal of all layers in AlexNet.

The optimal of all the layers in AlexNet is given in Table II. It can be seen that when compared to other extension approaches, i.e extending the number of PEs only () and extending the number of PE arrays only (), our accelerator that implemented with the optimal achieves the highest performance for all layers. In addition, our accelerator achieves 100.9 GFLOPS for fc-6, which demonstrates that our multi-array architecture can achieve high ratio (i.e. up to 98.6%) of sustained performance to theoretical peak performance (which is denoted by [6]).

Vi Conclusion

In this paper, we focus on the architecture extension of the linear array architecture for matrix multiplication on FPGA, by proposing a highly configurable and scalable multi-array architecture. We employ a work-stealing scheme to realize workload balancing among PE arrays. An efficient analytical model is developed to determine the optimal design options for the architecture extension. Experimental results show that our optimal extension of the linear array architecture can reach the highest performance and computation efficiency.


  • [1] G. Govindu, R. Scrofano, and V. K. Prasanna, “A library of parameterizable floating-point cores for fpgas and their application to scientific computing,” in Proceedings of the International Conference on Engineering Reconfigurable Systems and Algorithms, 2005, pp. 137–148.
  • [2] G. Lei, Y. Dou, R. Li, and F. Xia, “An fpga implementation for solving the large single-source-shortest-path problem,” IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 63, no. 5, pp. 473–477, 2016.
  • [3] J.-w. Jang, S. Choi, and V. Prasanna, “Area and time efficient implementations of matrix multiplication on FPGAs,” in Proc. IEEE International Conference on Field-Programmable Technology, 2002. (FPT’02), Dec. 2002, pp. 93–100.
  • [4] L. Zhuo and V. Prasanna, “Scalable and modular algorithms for floating-point matrix multiplication on FPGAs,” in Proc. 18th International Parallel and Distributed Processing Symposium, Apr. 2004.
  • [5] J.-w. Jang, S. Choi, and V. Prasanna, “Energy- and time-efficient matrix multiplication on FPGAs,” IEEE Trans. VLSI Syst., vol. 13, no. 11, pp. 1305–1319, 2005.
  • [6] Y. Dou, S. Vassiliadis, G. K. Kuzmanov, and G. N. Gaydadjiev, “64-bit floating-point FPGA matrix multiplication,” in Proc. ACM International Symposium on Field-programmable Gate Arrays(FPGA’05), New York, NY, USA, 2005, pp. 86–95.
  • [7] V. Kumar, S. Joshi, S. Patkar, and H. Narayanan, “FPGA based high performance double-precision matrix multiplication,” in Proc. International Conference on VLSI Design (VLSI’09), 2009, pp. 341–346.
  • [8] Z. Jovanovic and V. Milutinovic, “FPGA accelerator for floating-point matrix multiplication,” IET Computers & Digital Techniques, vol. 6, no. 4, pp. 249–256, 2012.
  • [9] Z. Liu, Y. Dou, J. Jiang, J. Xu, S. Li, Y. Zhou, and Y. Xu, “Throughput-optimized fpga accelerator for deep convolutional neural networks,” ACM Transactions on Reconfigurable Technology and Systems (TRETS), vol. 10, no. 3, p. 17, 2017.
  • [10] L. Zhuo and V. K. Prasanna, “Scalable and modular algorithms for floating-point matrix multiplication on reconfigurable computing systems,” Parallel and Distributed Systems, IEEE Transactions on, vol. 18, no. 4, pp. 433–448, 2007.
  • [11] G. Wu, Y. Dou, and M. Wang, “High performance and memory efficient implementation of matrix multiplication on FPGAs,” in IEEE International Conference on Field-Programmable Technology (FPT’2010), 2010, pp. 134–137.
  • [12] R. D. Blumofe and C. E. Leiserson, “Scheduling multithreaded computations by work stealing,” Journal of the ACM (JACM), vol. 46, no. 5, pp. 720–748, 1999.
  • [13]

    A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in

    Advances in neural information processing systems, 2012, pp. 1097–1105.
  • [14] J. Cong and B. Xiao, “Minimizing computation in convolutional neural networks,” in Proc. International Conference on Artifical Neural Networks, 2014, pp. 281–290.