I Introduction
Largescale floatingpoint matrix multiplication is widely used in many complicated computation tasks such as scientific computing [1]
and deep learning
[2]. Recently, fieldprogrammable gate arrays (FPGAs) have become a particularly attractive option for accelerating largescale matrix multiplication due to their reconfigurability and abundant logic resources. Previous studies [3, 4, 5, 6, 7, 8, 9] have primarily focused on accelerating matrix multiplication on FPGA by using an efficient architecture, i.e. the onedimensional systolic array. This architecture was demonstrated successfully in matrix multiplication acceleration, which contributes to low bandwidth requirement and good scalability of the accelerator designs.In this paper, we focus on the extension of the linear array architecture. According to our studies, there exist three approaches to extend the linear array architecture: 1. increasing the length of the linear array; 2. adopting multiple parallel linear arrays; 3. combining approaches 1 and 2. As a result, the design space is significantly expanded, which make it more difficult to reach the optimal solution than previous work. In addition, the computation efficiency of the accelerator is hard to be ensured if we adopt a fixed architecture for various problem sizes. This paper addresses these challenges by proposing a highly configurable and scalable multiarray architecture, which allows users to dynamically change the number of PEs in used as well as the number of parallel PE arrays. In addition, a workstealing scheme is adopted to guarantee the equality of the workloads partition among the PE arrays. To determine the optimal design option for the proposed architecture, we also propose an analytical model to evaluate the realistic memory bandwidth as well as quantifying data traffic volumes.
The rest of this paper is organized as follows. We review the background in Section II. The proposed multiarray architecture is illustrated in Section III. Section IV presents the performance model. The experimental results are presented in Section V. Section VI concludes this paper.
Ii Background
In this paper, we focus on the dense matrix multiplication (MM) algorithm: , where matrix A , B and C , respectively. Equation 1 describes the computation pattern of this algorithm:
(1) 
Since it is infeasible to store the entire input and output matrices on the FPGA, abundant researches [6, 5, 4, 10, 7, 11, 8] adopt a similar block matrix multiplication algorithm in their designs. Here, we introduce the block matrix multiplication algorithm in Dou [6], which has been proved to be successful in matrix multiplication acceleration. Initially, matrix A is split into subblocks (namely ) of size and matrix B is partitioned into subblocks (namely ) of size . In this way, the result matrix C can be calculated by performing subblock matrix multiplications on the and , where and . The basic idea of this algorithm is to split the multiplication of and
into multiple innerproduct operations of two vectors, i.e
and , where is the column of and is the row of (). Here we define as the product of and , then can be calculated by accumulating , ,.., and iteratively, shown as(2) 
Iii Proposed Architecture
Fig. 1 presents an overview of our proposed architecture for acceleration of matrix multiplication. Due to limited amount of onchip memory on FPGAs, the source data and final results are stored in the offchip memory (i.e. the DDR). It can be seen that the accelerator is composed of several modules, namely the Memory Access Controller (MAC), the Workload Queue Management (WQM), and the Matrices Processing Engine (MPE).
Iiia MPE Design
The MPE is the kernel computation module of our architecture. As shown in Fig. 1, the MPE module consists of several linear arrays of PEs, in addition with some multiplexers placed between adjacent PE arrays.
We apply two operation modes in the two adjacent PE arrays, namely the Independent mode and the Cooperation mode. In the Independent mode, the multiplexer between the PE arrays is disabled, meaning that the PE arrays can execute computation tasks independently without any data communication. While in the Cooperation mode, the multiplexer between the PE arrays is enabled. As a result, the data paths of the PE arrays are connected by the multiplexer. As shown in Fig. 1, the PE array that placed behind a multiplier can fetch data from the proceeding PE array in this mode. In Cooperation mode, the required memory bandwidth of the PE arrays is lower since the PE arrays share the same memory interface when they are connected. In addition, larger block sizes can be supported in the Cooperation mode since the number of PEs in the connected array has increased. Note that the multipliers are initialized by the host CPU, meaning that the multiarray architecture is highly configurable. More importantly, our architecture preserves the scalability of the linear array architecture.
The fully pipelined structure of PE is presented in the right part of Fig. 1. It can be seen that the PE consists of two sets of data registers for input data buffering, three FirstInFirstOuts (FIFOs) for delivering data between PEs, local memory for temperate data storing, and floatingpoint multiplyandaccumulate unit (). Different from previous studies, we implement additional control units to support arbitrary block size. In addition, we implement a phase synchronization unit () to guarantee the correctness of the final results when the block sizes for matrices A and B are different. By conditionally inserting stalls into the computation pipeline of the PE, the ensures that the column of and row of are fetched into each PE simultaneously. The dataflow in each PE consists of three stages:
Prefetch. In this stage, the PE picks up the corresponding element in (i.e. the first column of ) based on the PE identifier (PID), then stores the data into the local register . For instance, with will picks up the second element in .
Compute. In this stage, the row of (i.e. ) and the column of (i.e. ) are fetched into the PE simultaneously, where . The buffered element in is multiplied with all the elements of in order. Therefore, the data buffered in is reused times. In the meantime, the PE also buffers the corresponding data in into . Note that we apply double buffering in to overlap buffering data of the next iteration and computation of the current iteration. The products of the multipliers in FMAC are then added with the intermediate results generated in the previous iteration, which are stored in the local memory . Finally, the newly sums are written back into the memory . Note that in the last iteration (i.e. ), the final results are written into FIFO instead of memory .
Write back. In this stage, the PE sends its local results to the proceeding PE or the MAC module (only ) from the . As a results, the result data are delivered from the end of each independent PE arrays to the MAC module.
IiiB WQM Design
The WQM module is responsible for workloads assignment for the PE arrays. It manages multiple workload queues to buffer the computation tasks for the PE arrays. Note that one workload queue corresponds to one PE array. For the proposed multiarray architecture, the steadiness of an even partition of workloads among PE arrays is the key to achieve better performance. However, it is difficult to ensure that the workloads are always equally partitioned during the whole computation procedures of PE arrays. For example, sometime the PE array which is assigned with less workloads could obtain higher memory bandwidth, which may worsen the inequality of workloads partition. As a result, the system performance would bottlenecked by the PE array with the most workloads. To address this issue, we adopt the workstealing scheme [12] in the design of the WQM module.
The basic idea of the workstealing scheme is to enable an idle PE array to acquire computation tasks from the overloaded PE array(s).
Fig. 2 depicts the working procedure of the workstealing scheme. Note that a controller (omitted in Fig. 2) is designed to manage the workload delivery among the workload queues. It can be seen that a counter is implemented to record the number of tasks for each workload queue. Once the controller detects that a workload queue becomes empty, it will immediately steal a task from a nonempty queue, then load it into the empty workload queue. If there exist more than one nonempty queues, the controller will select the workload queue that with the most workloads as target, by comparing the corresponding counters of the workload queues. It is important to note that we implement a roundrobin arbiter in the controller to arbitrate multiple concurrent workstealing requests. The controller repeats the detection and arbitration during the entire computation procedure of the PE arrays.
IiiC MAC Design
The MAC module is responsible for managing data transfer between the external memory and the accelerator. As shown in Fig. 2, the workloads executed by the MAC module are organized by a selfdefined data structure named buffer descriptor. A buffer descriptor contains the following parameters: specifies the memory locations that store the submatrices;
specifies the stride of each memory transfer;
specifies the block sizes and specifies the iteration ().As mentioned in the above context, elements of matrix A are fetched into the PE arrays in columnmajor order. However, the matrix A is stored in rowmajor order. Therefore, the access of matrix A may cause inefficient memory bandwidth utilization. To improve the effective memory bandwidth, we transpose matrix A to allow its data to be fetched in rowmajor order. In this way, the burst transfer mode that favored by the external memory can be used to access both matrices A and B. As a result, the memory bandwidth for the accelerator is significantly improved, which contributes to performance improvement of the overall system.
Iv Performance Modeling
In this section, we will illustrate how to determine the optimal solution of mapping the block matrix multiplication algorithm onto the multiarray architecture.
Let the bandwidth of the offchip memory be (MB/s), the number of PE in a single PE array be (when all the multiplexers are disabled), the number of PE arrays work in parallel be , block size of the matrix A (on rows) be and block size of the matrix B (on columns) be . For A of size and B of size , the average number of subblock matrix multiplications performed by each PE array can be expressed as
(3) 
Note that we pad matrices
A and B with zeros if and are not integer multiples of and . In addition, the time (in seconds) taken to load a workload (i.e. and ) and write back the corresponding can be calculated by(4) 
To simplify the model, we assume that all the workloads are equally partitioned. Therefore, the time taken to transfer data between the external memory and the PE arrays can be expressed as
(5) 
According to the data path described in section III, the computation time (in seconds) of a single PE array can be determined as follows:
(6) 
where denotes the stages of the computation pipeline in each PE, and
is the working frequency of the accelerator. Since the memory access and computation process are overlapped in our architecture, it is difficult to directly estimate the execution time of the accelerator. However, the lower bound and upper bound of the execution time
can be determined by(7) 
To simplify the discussion on the parameters that affect , we assume for the rest of this paper. It can be inferred that the attainable memory bandwidth for each PE array is mainly affected by and , which can be expressed by
(8) 
This is because determines te burst length of memory access, and affects the conflicts of memory accesses of the PE arrays. From equations 3, 4, 5, 6 and 8, it can be seen that and are the key factors that affect the performance of our accelerator. To reduce the size of design space, we consider the constraints on and . We observe that there exists a relationship between and . For better understanding, we denote as the maximum number of the independent PE arrays (when all the multiplexers are disabled), then the relationship between and can be determined as
(9) 
To this end, the size of design space can be reduced with the assistance of equation 9. Given the fixed problem size and (i.e. the total number of PEs), the proposed analytical model can be used to determine the optimal that minimizes the range of .
V Experimental Results
The FPGA platform used in our experiment is the Xilinx VC709 board, which equips with a XC7VX690T FPGA and two DDR3 DRAMs. In addition, all synthesized results are obtained from Xilinx Vivado 2016.4.
Resource 
DSP48Es  BRAMs  FlipFlops  LUTs 

Utilization  1032  560.50  292016  192493 
percentage(%)  28.67  38.13  33.70  44.44 
In order to quantify function , we evaluate the average effective memory bandwidth of a PE array in terms of block sizes and number of PE arrays. As shown in Fig. 3, two observations can be found. First, the effective memory bandwidth goes up with the increase of block size. Second, the effective bandwidth declines when we increase the number of PE arrays.
As a case study, we use a reallife CNN model, AlexNet [13], to validate our analytical model. Note that AlexNet comprises of five convolutional layers and three fully connected layers, and the computation patterns of these layers can be converted to matrix multiplication [14]. Note that we are mainly focus on determining the optimal design parameters under fixed and , therefore we do not make full use of the resource of the FPGA to pursuit maximum performance. In our experiment, we set and . After postsynthesis, a maximum frequency of 200 MHz () is achieved. Table I summarized the resource utilization of the overall system. It can be seen that the overall resource utilization is below 50%, which contributes to the high frequency of our accelerator.
We give a detailed comparison between the predicted and actual execution time for conv2 in Fig. 4. It can be seen that the predicted lower bound of execution time closely follows the actual measurement when the memory requirement of each PE array is satisfied. However, when the memory bandwidth requirement is unsatisfied, the actual time of each PE array becomes more close to the upper bound of predicted execution time. In addition, it can also be found that using multiple PE arrays does not ensure the optimal performance. For example, the case of achieves lower execution time than the case of . The main reason is that both of the cases are memorybound ( 1.6 GB/s), and the case of can reach higher memory bandwidth (it can be confirmed by Fig. 3), which contributes to its higher performance.
Layers  Optimal  Performance (GFLOPS)  

Optimal  
Conv1  96*363*3025  (2,128)  59.7  57.1  49.2 
Conv2  128*1200*729  (2,128)  87.8  70.3  61.4 
Conv3  384*2304*169  (2, 96)  64.9  62.9  57.4 
Conv4  192*1728*169  (2, 96)  64.1  54.8  51.2 
Conv5  128*1728*169  (2,128)  62.9  44.9  43.9 
fc6  128*9216*4096  (2,128)  100.9  79.3  70.7 
fc7  128*4096*4096  (2,128)  99.3  78.1  69.5 
fc8  128*4096*1000  (2,128)  96.9  83.6  67.8 
The optimal of all the layers in AlexNet is given in Table II. It can be seen that when compared to other extension approaches, i.e extending the number of PEs only () and extending the number of PE arrays only (), our accelerator that implemented with the optimal achieves the highest performance for all layers. In addition, our accelerator achieves 100.9 GFLOPS for fc6, which demonstrates that our multiarray architecture can achieve high ratio (i.e. up to 98.6%) of sustained performance to theoretical peak performance (which is denoted by [6]).
Vi Conclusion
In this paper, we focus on the architecture extension of the linear array architecture for matrix multiplication on FPGA, by proposing a highly configurable and scalable multiarray architecture. We employ a workstealing scheme to realize workload balancing among PE arrays. An efficient analytical model is developed to determine the optimal design options for the architecture extension. Experimental results show that our optimal extension of the linear array architecture can reach the highest performance and computation efficiency.
References
 [1] G. Govindu, R. Scrofano, and V. K. Prasanna, “A library of parameterizable floatingpoint cores for fpgas and their application to scientific computing,” in Proceedings of the International Conference on Engineering Reconfigurable Systems and Algorithms, 2005, pp. 137–148.
 [2] G. Lei, Y. Dou, R. Li, and F. Xia, “An fpga implementation for solving the large singlesourceshortestpath problem,” IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 63, no. 5, pp. 473–477, 2016.
 [3] J.w. Jang, S. Choi, and V. Prasanna, “Area and time efficient implementations of matrix multiplication on FPGAs,” in Proc. IEEE International Conference on FieldProgrammable Technology, 2002. (FPT’02), Dec. 2002, pp. 93–100.
 [4] L. Zhuo and V. Prasanna, “Scalable and modular algorithms for floatingpoint matrix multiplication on FPGAs,” in Proc. 18th International Parallel and Distributed Processing Symposium, Apr. 2004.
 [5] J.w. Jang, S. Choi, and V. Prasanna, “Energy and timeefficient matrix multiplication on FPGAs,” IEEE Trans. VLSI Syst., vol. 13, no. 11, pp. 1305–1319, 2005.
 [6] Y. Dou, S. Vassiliadis, G. K. Kuzmanov, and G. N. Gaydadjiev, “64bit floatingpoint FPGA matrix multiplication,” in Proc. ACM International Symposium on Fieldprogrammable Gate Arrays(FPGA’05), New York, NY, USA, 2005, pp. 86–95.
 [7] V. Kumar, S. Joshi, S. Patkar, and H. Narayanan, “FPGA based high performance doubleprecision matrix multiplication,” in Proc. International Conference on VLSI Design (VLSI’09), 2009, pp. 341–346.
 [8] Z. Jovanovic and V. Milutinovic, “FPGA accelerator for floatingpoint matrix multiplication,” IET Computers & Digital Techniques, vol. 6, no. 4, pp. 249–256, 2012.
 [9] Z. Liu, Y. Dou, J. Jiang, J. Xu, S. Li, Y. Zhou, and Y. Xu, “Throughputoptimized fpga accelerator for deep convolutional neural networks,” ACM Transactions on Reconfigurable Technology and Systems (TRETS), vol. 10, no. 3, p. 17, 2017.
 [10] L. Zhuo and V. K. Prasanna, “Scalable and modular algorithms for floatingpoint matrix multiplication on reconfigurable computing systems,” Parallel and Distributed Systems, IEEE Transactions on, vol. 18, no. 4, pp. 433–448, 2007.
 [11] G. Wu, Y. Dou, and M. Wang, “High performance and memory efficient implementation of matrix multiplication on FPGAs,” in IEEE International Conference on FieldProgrammable Technology (FPT’2010), 2010, pp. 134–137.
 [12] R. D. Blumofe and C. E. Leiserson, “Scheduling multithreaded computations by work stealing,” Journal of the ACM (JACM), vol. 46, no. 5, pp. 720–748, 1999.

[13]
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in
Advances in neural information processing systems, 2012, pp. 1097–1105.  [14] J. Cong and B. Xiao, “Minimizing computation in convolutional neural networks,” in Proc. International Conference on Artifical Neural Networks, 2014, pp. 281–290.
Comments
There are no comments yet.