AccD: A Compiler-based Framework for Accelerating Distance-related Algorithms on CPU-FPGA Platforms

08/26/2019 ∙ by Yuke Wang, et al. ∙ The Regents of the University of California 0

As a promising solution to boost the performance of distance-related algorithms (e.g., K-means and KNN), FPGA-based acceleration attracts lots of attention, but also comes with numerous challenges. In this work, we propose AccD, a compiler-based framework for accelerating distance-related algorithms on CPU-FPGA platforms. Specifically, AccD provides a Domain-specific Language to unify distance-related algorithms effectively, and an optimizing compiler to reconcile the benefits from both the algorithmic optimization on the CPU and the hardware acceleration on the FPGA. The output of AccD is a high-performance and power-efficient design that can be easily synthesized and deployed on mainstream CPU-FPGA platforms. Intensive experiments show that AccD designs achieve 31.42x speedup and 99.63x better energy efficiency on average over standard CPU-based implementations.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Distance-related algorithm (e.g., K-means [LloydKMeans], KNN [altman1992introduction], and N-body Simulation [NBody-simulation]

) plays a vital role in many domains, including machine learning, computational physics, etc. However, these algorithms often come with high computation complexity, leading to poor performance and limited applicability. To improve their performance, FPGA-based acceleration gains lots of interests from both industry and research field, given its great performance and energy-efficiency. However, accelerating distance-related algorithms on FPGAs requires non-trivial efforts, including the hardware expertise, time and monetary cost. While existing works try to ease this process, they inevitably fall in short in one of the following aspects.

Rely on problem-specific design and optimization while missing effective generalization. There is no such unified abstraction to formalize the definition and optimization of distance algorithms systematically. Most of the previous hardware designs and optimizations [KMeansMicroarray, lin2012k, kdtreeKMeanscolorimage, KNNfpgahls] are heavily coded for a specific algorithm (e.g., K-means), which can not be shared with different distance-related algorithms. Moreover, these ”hard-coded” strategies could also fail to catch up with the ever-changing upper-level algorithmic optimizations and the underlying hardware settings, which could result in a large cost of re-design and re-implementation during the design evolvement.

Lack of algorithm-hardware co-design. Previous algorithmic [elkan2003using, ding2015yinyang] and hardware optimizations [lin2012k, kdtreeKMeanscolorimage, KMeansMicroarray, multicoreKMeans, KNNfpgahls] are usually applied separately instead of being combined collaboratively. Existing algorithmic optimizations, most of which are based on Triangle Inequality (TI) [elkan2003using, ding2015yinyang, Topframework, chen2017sweet], are crafted for sequential-based CPU. Despite removing a large number of distance computations, they also incur high computation irregularity and memory overhead. Therefore, directly applying these algorithmic optimizations to massively parallel platforms without taking appropriate hardware-aware adaption could lead to inferior performance.

Count on FPGAs as the only source of acceleration. Previous works [ParallelArchitecturesKNN, IPcoresKNN, ParameterizedKMeans, Lavenier00fpgaimplementation, KMeansMicroarray, KNNfpgahls] place the whole algorithm on the FPGA accelerator without considering the assists from the computing resource on the host CPU. As a result, their designs are usually limited by the on-chip memory and computing elements, and cannot fully exploit the power of the FPGA. Moreover, they miss the full performance benefits from the heterogeneous computing paradigm, such as using the CPU for complex logic and control operations while offloading the compute-intensive tasks to the FPGA.

Lack of well-structured design workflow. Previous works [ParallelArchitecturesKNN, ParameterizedKMeans, kdtreeKMeanscolorimage, lin2012k, KNNfpgahls] follow the traditional way of hardware implementation and require intensive user involvement in hardware design, implementation, and extra manual tuning process, which usually takes long development-to-validation cycles. Also, the problem-specific strategy leads to a case-by-case design process, which cannot be widely applied to handle different problem settings.

Fig. 1: AccD Overview.

To this end, we present a compiler-based optimization framework, AccD, to automatically accelerate distance-related algorithms on the CPU-FPGA platform (shown in Figure 1). First, AccD provides a Distance-related Domain-Specific Language (DDSL) as a problem-independent abstraction to unify the description and optimization of various distance-related algorithms. With the assist of the DDSL, end-user can easily create highly-efficient CPU-FPGA designs by only focusing on high-level problem specification without touching the algorithmic optimization or hardware implementation.

Second, AccD offers a novel algorithmic-hardware co-optimization scheme to reconcile the acceleration from both sides. At the algorithmic level, AccD incorporates a novel Generalized Triangle Inequality (GTI) optimization to eliminate unnecessary distance computations, while maintaining the computation regularity to a large extent. At the hardware level, AccD employs a specialized data layout to enforce memory coalescing and an optimized distance computation kernel to accelerate the distance computations on the FPGA.

Third, AccD leverages both the host and accelerator side of the CPU-FPGA heterogeneous system for acceleration. In particular, AccD distributes the algorithm-level optimization (e.g., data grouping and distance computation filtering) to CPU, which consists of complex operations and execution dependency, but lacks pipeline and parallelism. On the other hand, AccD assigns hardware-level acceleration (e.g., distance computations) to the FPGA, which is composed of simple and vectorizable operations. Such mapping successfully capitalizes the benefit of CPU for managing control-intensive tasks and the advantage of FPGA for accelerating computation-intensive workloads.

Lastly, AccD compiler integrates an intelligent Design Space Explorer (DSE)

to pinpoint the ”optimal” design for different problem settings. In general, there is no existing ”one size fits all” solution: the best configuration for algorithmic and hardware optimization would differ across different distance-related algorithms or different inputs of the same distance-related algorithm. To produce a high-quality optimization configuration automatically and efficiently, DSE combines the design modeling (performance and resource) and Genetic Algorithm to facilitate the design space search.

Overall, our contributions are:

  • We propose the first optimization framework that can automatically optimize and generate high-performance and power-efficient designs of distance-related algorithms on CPU-FPGA heterogeneous computing platforms.

  • We develop a Domain-specific Language, DDSL, to unify different distance-related algorithms in an effective and succinct manner, laying the foundation for general optimizations across different problems.

  • We build an optimizing compiler for the DDSL, which automatically reconciles the benefits from both the algorithmic optimization on CPU and hardware acceleration on FPGA.

  • Intensive experiments on several popular algorithms across a wide spectrum of datasets show that AccD-generated CPU-FPGA designs could achieve speedup and better energy-efficiency on average compared with standard CPU-based implementations.

Ii Related Work

Previous research accelerates distance-related algorithms in two aspects: Algorithmic Optimization and Hardware Acceleration. More details are discussed in the following subsections.

Ii-a Algorithmic Optimization

From the algorithmic standpoint, previous research highlights two optimizations. The first one is KD-tree based optimization [KD-TreeKMeans, efficientKmeans, KNNJoinsDataStreams, 5952342, Zhong:2013:GEI:2505515.2505749], which relies on storing points in special data structures to enable nearest neighbor search without computing distances to all target points. These methods often deliver performance improvement [KD-TreeKMeans, efficientKmeans, KNNJoinsDataStreams, 5952342, Zhong:2013:GEI:2505515.2505749] compared with the unoptimized versions in low dimensional space, while suffering from a serious performance degradation when handling large datasets with high dimension () due to their exponentially-increased memory and computation overhead.

The second one is TI based optimization [elkan2003using, ding2015yinyang, Topframework, chen2017sweet], which aims at replacing computation-expensive distance computations with cheaper bound computations, demonstrates its flexibility and scalability. It can not only reduce the computation complexity at different levels of granularity but is also more adaptive and robust to the datasets with a wide range of size and dimension. However, most existing works focus on one specific algorithm (e.g., KNN [chen2017sweet], K-means [elkan2003using, ding2015yinyang], etc.), which lack extensibility and generality across different distance-related problems. An exception is a recent work, TOP [Topframework], which builds a unified framework to optimize various distance-related problems with pure TI optimization on CPUs. Our work shares a similar high-level motivation with their work, but targets at a more challenging scenario: algorithmic and hardware co-optimization on CPU-FPGA platforms.

Ii-B Hardware Acceleration

From the hardware perspective, several FPGA accelerator designs have been proposed, but still suffer from some major limitations.

First, previous FPGA designs are generally built for specific distance-related algorithm and hardware. For example, works from [KMeansMicroarray, kdtreeKMeanscolorimage, lin2012k] target on KNN FPGA acceleration, while researches from [KNNfpgahls, IPcoresKNN, ParallelArchitecturesKNN] focus on K-means. Moreover, previous designs [lin2012k, KMeansMicroarray] usually assume that dataset can be fully fit into the FPGA on-chip memory, and they are only evaluated on a limited number of small datasets, for example, in [lin2012k], K-means acceleration is evaluated on a micro-array dataset with only 2,905 points. These designs often encounter portability issues when transferring to different settings. Besides, these ”hard-coded” designs and optimizations create difficulties for a fair comparison among different designs, which hamper future studies in this direction.

The second problem with previous works is that they fail to incorporate algorithmic optimizations in the hardware design. For example, works from [KMeansMicroarray, ParallelArchitecturesKNN, kdtreeKMeanscolorimage, KNNfpgahls], directly port the standard K-means and KNN algorithms to FPGA, and only apply hardware-level optimization. One exception is a recent work [KPynq], which promotes to combine TI optimization and FPGA acceleration for K-means. It gives a considerable speedup compared to state-of-the-art methods, showcasing the great opportunity of applying algorithm-hardware co-optimization. Nevertheless, this idea is far from well-explored, possibly because it requires the domain knowledge and expertise from both the algorithm and hardware to combine both of them effectively.

In addition, previous works largely focus on the traditional hardware design flow, which requires a long implementation cycle and huge manual efforts. For example, works from [ParameterizedKMeans, KMeansMicroarray, multicoreKMeans, kdtreeKMeanscolorimage, ICSICT2016, Lavenier00fpgaimplementation, adaptiveKNNPartialReconfiguration, adaptiveKNN] build the design based on VHDL/Verilog design flow, which requires hardware expertise and over months of arduous development. In contrast, our AccD design flow brings significant advantages of programmability and flexibility due to its high-level OpenCL-based programming model, which minimizes the user involvement in the tedious hardware design process.

Iii Distance-related Algorithm Domain-Specific Language (DDSL)

Distance-related algorithms share commonalities across different application domains and scenarios, even though they look different in their high-level algorithmic description. Therefore, it is possible to generalize these distance-related algorithms. AccD framework defines a DDSL, which provides a high-level programming interface to describe distance-related algorithms in a unified manner. Unlike the API-based programming interface used in the TOP framework [Topframework], DDSL is built on C-like language and provides more flexibility in low-level control and performance tuning support, which is crucial for FPGA accelerator design.

Specifically, DDSL utilizes several constructs to describe the basic components (Definition, Operation, and Control) of the distance-related algorithms, and also identify the potential parallelism and pipeline opportunities during the design time. We detail these constructs in the following part of this section.

Iii-a Data Construct

Data construct is a basic Definition Construct. It leverages DSet primitive to indicate the name of the data variable, and the DType primitive to notate the type characters of the defined variable. Data construct serves as the basis for AccD compiler to understand the algorithm description input, such as the data points that are used in the distance-related algorithms. An example of data constructs is shown in the code below, where we define the variable and dataset using DDSL data construct.

/* Define a single variable */
DVar [setName] DType [Optional_Initial_Value];
/* Define the matrix of dataset */
DSet [setName] DType [size] [dim];

In most distance-related algorithms, the dataset can be defined as the source set and the target set. For example, in K-means, the source set is the set of data points, and the target set is the set of clusters. Currently, AccD supports several data types including int (32-bit), float (32-bit), double (64-bit) based on the users’ requests, algorithm performance, and accuracy trade-offs.

Iii-B Distance Compute Construct

Distance computation is the core Operation Construct for distance-related algorithms, which measures the exact distance between two different data points. This construct requires several fields, including data dimensionality, distance metrics, and weight matrix (if weighted distance is specified).

AccD_Comp_Dist(Input p1, Input p2, Output disMat, Output idMat, Dim dim, Met mtr, Weg mat)
p1, p2 Input data matrix. (, )
disMat Output distance matrix. ()
idMat Output id matrix. ()
dim Dimensionality of input data point.
mtr Distance metric:(Weighted—Unweighted)
mat Weight matrix: Used for weighted distance ()
TABLE I: Distance Compute Construct Parameters.

Iii-C Distance Selection Construct

Distance selection construct is an Operation Construct for distance value selection and it returns the Top-K smallest or largest distances and their corresponding points ID number from the provided distance and ID list. This construct helps AccD compiler to understand the distances of users’ interests.

AccD_Dist_Select(Input distMat, Input idMat, Output TopKMat, Range ran, Scope scp)
TopKMat Top-K id matrix ()
ran Scalar value of K (e.g., K-means, KNN) or distance threshold (e.g., N-body Simulation)
scp Top-K (smallest—largest) values
TABLE II: Distance Selection Construct Parameters.

Iii-D Data Update Construct

Data update construct is an Operation Construct for updating the data points based on the results from the prior constructs. For example, K-means updates the cluster centers by averaging the positions of the points inside. This construct requires the variable to be updated and additional information to finish this update, such as the point-to-cluster distances. The status of this data update will be returned after the completion of all its inside operations. The status variable is to tell whether the data update makes a difference or not.

AccD_Update(Update var, Input p1 ,..., Input pm, Status s)
upVar Input data/dataset to be updated
p1, …, pm Additional information used in update
Status of update operation.
TABLE III: Data Update Construct Parameters.

Iii-E Iteration Construct

Iteration construct is a top-level Control Construct. It is used to describe the distance-related algorithms that require iteration, such as K-means. Iteration construct requires users to provide either the maximum number of iteration or other exit condition.

AccD_Iter(maxIterNum|exitCond){
    subConstruct ;
    subConstruct ;
    ...
    subConstruct ;
}

Iii-F Example: K-means

To show the expressiveness of DDSL, we take K-means as an example. From the code shown below, with no more than 20 lines of code, DDSL can capture the key components of user-defined K-means algorithm, which is essential for AccD compiler to generate designs for CPU-FPGA platforms.

DVar K int 10;
DVar D int 20;
DVar psize int 1400;
DVar csize int 200;
DSet pSet float psize D;
DSet cSet float csize D;
DSet distMat float psize csize;
DSet idMat int psize csize;
DSet pkMat int psize K;
AccD_Iter(S){
    S = false;
    /* Compute the inter-dataset distances */
    AccD_Comp_Dist(pSet, cSet, distMat, idMat, D, "Unweighted L1", 0);
    /* Select the distances of interests */
    AccD_Dist_Select(distMat, idMat, K, "smallest", pkMat);
    /* Update the cluster center */
    AccD_Update(cSet, pSet, pkMat, S)
}

Iv Algorithm Optimization

This section explains a novel TI optimizations tailored for CPU-FPGA platforms. TI has been used for optimizing distance-related problems, but is often on the sequential processing systems. Our design features an innovative way of applying TI to obtain low-overhead distance bounds for unnecessary distance computation elimination while maintaining the computation regularity to ease the hardware acceleration on FPGAs.

Iv-a TI in Distance-related Algorithm

As a simple but powerful mathematical concept, TI has been used to optimize the distance-related algorithm. Figure 2a gives an illustration. It states that , where represents the distance between point and in some metrics (e.g., Euclidean distance). The assistant point is a landmark point for reference. Directly from the definition, we could compute both the lower bound () and upper bound () of the distance between two points A and B. This is the standard and most common usage of TI for deriving bounds of distance.

In general, bounds can be used as a substitute for the exact distances in the distance-related data analysis. Take N-body simulation as an example. It requires to find target points that are within (the radius) from each given query point. Suppose we get the and , then we are 100% confident that source point is not within of query point . As a result, there is no need to compute the exact distance between point and . Otherwise, the exact distance computation will still be carried out for direct comparison. While many previous researches [elkan2003using, ding2015yinyang, lin2012k, MakingKMeansFaster, KNNAdaptiveBound] gain success in directly porting the above point-based TI to optimize distance-related algorithms, they usually suffer from memory overhead and computations irregularity, which result in inferior performance.

Fig. 2: TI Optimization.

Iv-B Generalized Triangle Inequality (GTI)

AccD uses a novel Generalized TI (GTI) to remove redundant distance computation. It generalizes the traditional point-based TI while significantly reducing the overhead of bound computations. The traditional point-based TI focuses on tighter bound (more closer to the exact distance) to remove more distance computations, but it induces the extra bound computations, which could become the new performance bottleneck even after many distance calculations being removed. In contrast, GTI strikes a good balance between distance computation elimination and bound computation overhead. In particular, AccD highlights GTI from three perspectives: Two-landmark bound computation, Trace-based bound computation, and Group-level bound computation.

Two-landmark Bound Computation

Two-landmark scheme aims at reducing the bound computation through effective distance reuse. In this case, the distance bound between two points can be measured through two landmarks as the reference points. As illustrated in Figure 2b, the distance bound between point and can be computed based on , and through Equation 1, where and are the landmark points for point and , correspondingly.

(1)

One representative application scenario of Two-landmark bound computation is KNN-join, where two disjoint sets of landmarks are selected for the query and target point set. In this case, much fewer bound computations are required compared with the one-landmark case (shown in Figure 2a). This can also be validated through a simple calculation. Assuming in KNN-join, we have query points, target points, query landmarks, and target landmarks. Also, we have and in general. Therefore, we can get bound computations for Two-landmark case, which is much smaller than One-landmark bound computation ( or ).

Trace-based Bound Computation

Trace-based bound computation finds its strength in iterative distance algorithms with points update, since it can largely reduce the bound computation overhead over numbers of iterations. The key to Trace-based bound computation is selecting appropriate landmark points as references. For example, in K-means, only the target points (clusters) change their positions across iterations, therefore, we can choose the previous positions of clusters from the last iteration as the landmarks for bound computation in the current iteration, since these ”old” cluster positions can be close enough to the current point positions to offer ”tight” bound. This process can be illustrated in Figure 2c, where the distance bound can be calculated based on and , where is the new point position while is the old point position from the last iteration.

In addition, Trace-based bound computation can also work collaboratively with the Two-landmark cases. For example, in N-body simulation, the source and target points are essentially the same dataset and would get updated across iterations. We can choose the ”old” position of each point from the last iteration as the landmark for the bound computation at the current iteration, due to its closeness towards the current point position. This case can be clarified in Figure 2d, where and are the ”old” point positions from the last iteration, and are the new source and target point. Then based on , and , the new distance bound between and can be easily derived by using the old points and as the reference points. And the cost of this is also as low as , where is the number of particles, since each point only need to maintain the shifted distance between its new position and old position from the last iteration (Note: we only compute the inter-point distances at the first iteration). In contrast, only applying Two-landmark without the effective temporal reuse of the old point position will result in the complexity at least , where is number of landmarks. Since, in this case, distances between each point and all the landmarks have to be computed, so that each point can know its new closest landmark before applying the bound computation.

Group-level Bound Computations

Group-level bound computation aims at reducing the bound computation overhead while maintaining the computation regularity. Group-level bound computations features itself with the capability to combine with aforementioned two bound cases as the hybrid bound computation. In the combination with the Two-landmark case, as shown in Figure 2e, points in each group ( and ) share the same landmark ( and ) as the reference point. Then based on and the and , we can get the group-level bound based on Equation 2, where and get the distance between the farthest point within each group and its group reference point.

(2)

In the combination with the Trace-based case, it will generate a hierarchical bound as a hybrid solution, which includes point-group bound and point-point bound computation. As exemplified in Figure 2f, each group regards its old group center as the landmark for reference, and each point relies on its old position as the landmark for reference. Then based on , , and , and the old distance , and , where and are the point groups, and point is the closest point of point in the last iteration. We can calculate and based on Equation 3,

(3)

where . If we have , it is impossible that the points inside the group and can become the closest point of in the current iteration. Therefore, the distance computation between the point and all points inside these groups can be safely avoided.

In addition to distance saving, group-level bound computation offers another two benefits to facilitate the underlying hardware acceleration. First, the computation regularity on the remaining distance computation becomes higher compared with the point-level bound computation. Since points inside each group will share the commonality in computation, which facilitates the parallelization for acceleration. For example, point-level bound computation usually results in a large divergence of distance computation among different points, as shown in Figure 3a, which is a killer of parallelization and pipelining. However, in group-level bound computation, points inside the same source group will always maintain the same groups of target points for distance computation, as shown in Figure 3b.

Fig. 3: Bound Computation at (a) Point-level, (b) Group-level.

Second, group-level bound computation brings the benefit of reducing memory overhead. Assuming we have source points, target points, source groups, and target groups. The memory overhead of maintaining distance bounds is in the point-level bound computation case. However, in the group-level bound computation case, we only have to maintain distance bounds among groups, and the memory overhead is , where and . Therefore, in terms of memory efficiency, group-level bound computation can outperform the point-level bound computation to a great extent.

V Hardware Acceleration

AccD design is built on the CPU-FPGA architecture, which highlights its significant performance and energy efficiency, and has been widely adopted as the modern data center solution for high-performance computing and acceleration. The host-side application of AccD design is responsible for data grouping and distance computation filtering, which consists of complex operations and execution dependency, but lacks pipeline and parallelism. On the other hand, the FPGA-side of AccD design is built for accelerating the distance computations, which are composed of simple and vectorizable operations.

While FPGA accelerator features with high computation capability, the memory bandwidth bottleneck constraints the overall design performance. Therefore, optimizing data placement and memory architecture is the key to improving memory performance. In addition, the OpenCL-based programming model adds a layer of architectural complexity of the kernel design and management, which is also critical to the design performance. AccD framework distinguishes itself by using a novel memory and kernel optimization strategy that is tailored for TI-optimized distance-related algorithms to benefit CPU-FPGA designs.

V-a Memory Optimization

After applying the GTI optimization to remove the redundant distance computation, each source point group will have different target groups as candidates for distance computation, as shown in Figure 4a, where Source-grp is ID of the source group, and Target-grp is ID of the target group. However, this would raise two concerns about performance degradation.

Source-grp  Target-grp , , , , , , , ,
(a)
Source-grp  Target-grp , , , , , , , ,
(b)
Fig. 4: (a) Non-optimized inter-group memory access; (b) Optimized inter-group memory access.

The first issue is inter-group memory irregularity and low data reuse. For example, the target group information (, , ) required by source group can not be reused by . Since requires quite different target groups (, , and ) for distance computation, thus, additional costly memory access has to be carried out. To tackle this problem, AccD places the source groups to the continuous memory space to maximize the memory access efficiency, only if these source groups have the same set of target groups as candidates for distance computation. An example has been shown in Figure 4b, where the source group and are placed side by side in the memory, since they have the same list of target groups (, , and ), which can take advantage of the memory temporal locality without issuing another memory access.

The second issue is intra-group memory irregularity. For example, points from group 1, 2, and 3 have taken up the memory space at intervals, as shown in Figure 5a. However, a group of points are usually accessed simultaneously due to GTI optimization. This would cause frequent inefficient memory access for fetching individual point distributed at the discontinuous memory address. To solve this issue, AccD offers a second memory optimization to re-organize the target/source points inside the same target/source group into continuous memory space within the same memory bank, as illustrated in Figure 5b. This strategy can largely benefit memory coalescing and external memory bandwidth while minimizing the access contention, since points inside the same bank can be accessed efficiently and points inside different banks can be accessed in parallel.

Group  Points , , , , , ,
(a)
(b)
(c)
Fig. 5: (a) Group-point mapping; (b) Non-aligned intra-group memory; (c) Aligned intra-group memory.

V-B Distance Computation Kernel

Distance computation takes the major time complexity in distance-related algorithms. In AccD, after TI filtering on CPU, the remaining distance computations are accelerated on FPGA. Points involved in the remaining distance computations are organized into two sets: source set and target set, which can be organized as two matrices, () and (), respectively, where each row of these matrices represents a point with dimension. The distance computation between and can be decomposed into three parts, as shown in Equation 4,

(4)

where or only takes the complexity of and , while takes , which dominates the overall computation complexity. AccD spots an efficient way of accelerating through highly-efficient matrix-matrix multiplication, which can benefit the hardware implementation on FPGA.

Fig. 6: AccD Matrix-based Distance Computation.

The overall computation process can be described as Figure 6, the source () and target set (

) Row-wise Square Sum (RSS) is pre-computed through in a fully-parallel manner. And the vector multiplication between each source and target point is mapped to an OpenCL kernel thread for a fine-grained parallelization. Moreover, a block of threads, as highlighted in the ”red” square box of Figure 

6

, is the kernel thread workgroup, which can share a part of the source and target points to increase the on-chip data locality. Based on this kernel organization, AccD hardware architectural design offers several tunable hyperparameters for performance and resource trade-off: the size of kernel block, the number of parallel pipeline in each kernel block, etc. To efficiently find the ”optimal” parameters that can maximize overall performance while respecting the constraints, we harness the AccD explorer for efficient design space search, which is detailed in Section 

VI-B.

Vi AccD Compiler

In this section, we detail AccD compiler in two aspects: design parameters and constraints, and design space exploration.

Vi-a Design Parameters and Constraints

AccD uses a parameterized design strategy for better design flexibility and efficiency. It takes the design parameters and constraints from algorithm and hardware to explore and locate the ”optimal” design point tailored for the specific application scenario. At the algorithm level, the number of groups affects distance computation filtering performance. At the hardware level, there are three parameters: 1) Size of computation block, which decides the size of data shared by a group of computing elements; 2) SIMD factor, which decides the number of computing elements inside each computation block; 3) Unroll factor, which tells the degree of parallelization in each single distance computation. In addition, there are several hardware constraints, such as the on-chip memory size, the number of logic units, and the number of registers. All of these parameters and constraints are included in our analytical model for design exploration.

Vi-B Design Space Exploration

Finding the best combination of design configurations (a set of hyper-parameters) under the given constraints requires non-trivial efforts in the design space search. Therefore, we incorporate an AccD explorer in our compiler framework for efficient design space exploration (Figure 7). AccD explorer takes a set of raw configurations (hyper-parameters) as the initial input, and generates the optimal configuration as the output through several iterations of the design configuration optimization process. In particular, AccD explorer consists of three major phases: Configuration Generation and Selection, Performance and Resource Modeling, Constraints Validation.

Fig. 7: AccD Explorer.

Configuration Generation and Selection

The functionality of this phase depends on its input. There are two kinds of inputs: If the input is from the initial configurations, this phase will directly feed these configurations to the modeling phase for performance and resource evaluation; If the input is the result from the constraints validation in the last iteration, this phase will leverage the genetic algorithm to crossover the ”premium” configurations kept from the last iteration, and generate a new set of configurations for the modeling phase.

Performance Modeling

Performance modeling measures the design latency and bandwidth requirement based on the input design configurations. We formulate the design latency by using Equation 5,

(5)

where and are the time of the GTI filtering process and remaining distance computations, respectively. And they can be calculated as Equation 6,

(6)

where and are the number of groups for source and target points, respectively; and are the number of points inside source and target set, respectively; is the data dimensionality; is the number of grouping iteration; is the size of computation kernel block; is the FPGA design clock frequency; is the distance computation unroll factor; is the number of parallel worker threads inside each computation block; is the distance saving ratio through GTI filtering (Equation 7),

(7)

where the the density of points distribution. This formula also tells that increasing of number of iterations and number of points inside each group would improve the performance of GTI filtering performance. Also, the increase of points distribution density , (i.e. points are closer to each other) will decrease the GTI filtering performance.

To get the required bandwidth of the current design, we leverage Equation 8,

(8)

where the can be either 32-bit for int and float or 64-bit for double.

Resource Modeling

Directly measuring the hardware resource usage of the accelerator design from high-level algorithm description is challenging because of the hidden transformation and optimization in hardware design suite. However, AccD uses a micro-benchmark based methodology to measure the hardware resource usage by analytical modeling. The major hardware resource consumption of AccD comes from the distance computation kernel, which depends on several design factors, including the kernel block size, the number of SIMD workers, etc.

In AccD resource analytical model, the design factors are classified into two categories:

dataset-dependent and dataset-independent

factors. The main idea behind the AccD resource modeling is to get the exact hardware resource consumption statistics through micro-benchmark on the hardware designs with different dataset-independent factors. For example, we can benchmark a single distance computation kernel with different sizes of computation block to get its resource statistics. Since this factor is dataset-independent, which can be decided before knowing the dataset details. However, to estimate the resource consumption for datasets with different sizes and dimensionalities, AccD leverages the formula-based approach to estimate the overall hardware resource consumption (Equation 

9), which combines online information (e.g., kernel organization, and dataset properties) and offline information (e.g., miro-benchmark statistics).

(9)

where the types of can be on-chip memory, computing units, and logic operation units; is the estimated overall usage of a certain type resource for the overall design; is the usage of a certain type of resource for only one distance computation kernel block.

Constraints Validation

Constraints validation is the third phase of AccD explorer, which checks whether the design resources consumption of a given configuration is within the budget of the given hardware platform. The input of this phase are the resource estimation results from resource modeling step. The design constraint inequalities are listed in Equation 10, which includes (the size of on-chip memory), (the bandwidth of data communication between external memory and on-chip memory), (the number of computing units) and (the number of logic units):

(10)

Constraints validation phase will also discard the configurations that cannot match the design performance and constraints, and only keep the ”well-performed” configurations for further optimization in the next iteration. The constraints validation phase will also record the modeling information of the best configuration statistics in the last iteration, which will be used to terminate the optimization process if the modeling results difference between the configurations in two consecutive iterations is lower than a predefined threshold. This strategy can also help to avoid unnecessary time cost. After termination of the AccD explorer, the ”best” configuration with maximum design performance under the given constraints will be output as the ”optimal” solution for the AccD design.

Vii Evaluation

In this section, we choose three representative benchmarks (K-means, KNN-join, and N-body Simulation) and evaluate their corresponding AccD designs on the CPU-FPGA platform.

K-means

K-means [LloydKMeans, dataclustering50, efficientKmeans, coates2012learning, ray1999determination] clusters a set of points into several groups in an iterative manner. At each iteration, it first computes the distances between each point and all clusters, and then update the clusters based on the average position of their inside points. We choose it as our benchmark since it can show the benefits of AccD hierarchy (Trace-based + Group-level) bound computation optimization on iterative algorithms with disjoint source and target set.

KNN-join

KNN-join Search [altman1992introduction, KNNJoinsHybridApproach, KNNJoinsDataStreams] finds the Top-K nearest neighbor points for each point in the source set from the target set. It first computes the distances between each source point and all the target points. Then it ranks the K-smallest distances for each source point and gets its corresponding closest Top-K target points. KNN-join can help to demonstrate the effectiveness of AccD hybrid (Two-landmark + Group-level) bound computation optimization on non-iterative algorithms.

N-body Simulation

N-body Simulation [nylons2007fast, ida1992n] mimics the particle movement within a certain range of 3D space. At each time step, distances between each particle and its neighbors (within a radius ) are first computed, and then the acceleration and the new position of each particle will be updated based on these distances. While N-body simulation is also iterative, it has several differences compared with K-means algorithm: 1) N-body simulation has the same dataset (particles) for source and target set, whereas K-means operates on different source (point) and target (cluster) sets; 2) All points in the N-body simulation would change their positions according to the time variation, whereas in K-means only the target set (cluster) would change their positions during the center update; 3) N-body simulation has the same size of source and target set, whereas K-means target set (cluster) is much smaller than source set (point) in general. N-body simulation can help us to show the strength of AccD hybrid bound computation (Two-landmark + Trace-based + Group-level) on iterative algorithms with the same source and target set.

Vii-a Experiment Setup

Tools and Metrics

In our evaluation, we use Intel Stratix 10 DE10-Pro [DE10-Pro] as the FPGA accelerator and run the host side software program on Intel Xeon Silver 4110 processor [intelXeon] (8-core 16-thread, 2.1GHz base clock frequency, 85W TDP). DE10-Pro FPGA has 378,000 Logic elements (LEs), 128,160 adaptive logic modules (ALM), 512,640 ALM registers, 648 DSPs, and 1,537 M20K memory blocks. We implement AccD design on DE10-Pro by using Intel Quartus Prime Software Suite [IntelQuartus] with Intel OpenCL SDK included. To measure the system power consumption (Watt) accurately, we use the off-the-shelf Poniie PN2000 as the external power meter to get the runtime power of Xeon CPU and DE10 Pro FPGA.

Name  Techniques  Description
Baseline Standard Algorithm without any optimization, CPU. Naive for-loop based implementation on CPU.
TOP Point-based Triangle-inequality Optimized Algorithms, CPU. TOP [Topframework] optimized distance-related algorithm running on CPU.
CBLAS CBLAS library Accelerated Algorithms, CPU. Standard distance-related algorithm with CBLAS [openblas] acceleration.
AccD Algorithmic-hardware co-design, CPU-FPGA platform. GTI filtering and FPGA acceleration of distance computations.
TABLE IV: Implementation Description.
K-means KNN-join N-body Simulation
Dataset Size Dimension #Cluster Dataset Dimension #Source Dataset #Particle
Poker Hand 25,010 11 158 Harddrive1 64 68,411 P-1 16,384
Smartwatch Sens 58,371 12 242 Kegg Net Directed 24 53,413 P-2 32,768
Healthy Older People 75,128 9 274 3D Spatial Network 3 434,874 P-3 59,049
KDD Cup 2004 285,409 74 534 KDD Cup 1998 56 95,413 P-4 78,125
Kegg Net Undirected 65,554 28 256 Skin NonSkin 4 245,057 P-5 177,147
Ipums 70,187 60 265 Protein 11 26,611 P-6 262,144
TABLE V: Datasets for Evaluation.
(a)
(b)
(c)
Fig. 8: Performance Comparison (TOP, CBLAS, AccD): (a) K-means (b) KNN-Join (c) N-body Simulation. Note: Speedup is normalized w.r.t Baseline.
(a)
(b)
(c)
Fig. 9: Energy Efficiency Comparison (TOP, CBLAS, AccD): (a) K-means. (b) KNN-Join. (c) N-body Simulation. Note: Energy Efficiency is normalized w.r.t Baseline.

Implementations

The CPU-based implementations consist of three types of programs: the naive for-loop sequential implementation without any optimization (selected as our Baseline to normalize the speedup and energy-efficiency), the algorithm optimized by TOP [Topframework] framework and the algorithm optimized by CBLAS [openblas] computing library. Note that the TOP + CBLAS implementation is not included in our evaluation, since after applying TOP point-based TI filtering, each point in the source set has a distinctive list of points from the target set for distance computation, whereas CBLAS requires uniformity in the distance computations. Therefore, it is challenging to combine TOP and CBLAS optimization.

Dataset

In the evaluation, we use six datasets for each algorithm. The selected datasets can cover the wide spectrum of mainstream datasets, including datasets from UCI Machine Learning Repository [UCI-dataset], and datasets that have ever been used by previous papers [Topframework, ding2015yinyang, chen2017sweet] in the related domains. Details of these datasets are listed in Table V. Note that KNN-join algorithm will find the Top-1000 closest neighbors of each query point.

Vii-B Comparison with Software Implementation

Performance Comparison

As shown in Figure 8, TOP, CBLAS, and AccD achieve average , and compared with Baseline across all algorithm and dataset settings, respectively. As we can see, AccD design can always maintain the highest speedup among these implementations. This largely dues to AccD GTI optimization in reducing distance computation and its efficient hardware acceleration of the distance computation on FPGA.

We also observe that TOP implementation shows its strength for large datasets. For example, on dataset 3D Spatial Network () in KNN-join, TOP implementation achieves speedup. Since the fine-grained point-based TI optimization of TOP can reduce most (more than 90%) of the unnecessary distance computations, which benefits the overall performance to a great extent. Note that the intrinsic point distribution of the dataset would also affect the filtering performance of TOP, but in general, the larger dataset could lead TOP to spot and remove more redundant computations.

What we also notice is that CBLAS implementation demonstrates its performance on datasets with relatively high dimensionality. For example, on dataset KDD Cup 2004 () in the K-means algorithm, CBLAS achieves speedup over Baseline, which is higher than its performance on other K-means datasets. This is because, on high dimension dataset, CBLAS implementation can get more benefits of parallelized computing and more regularized memory access, whereas, in low dimension settings, the same optimization can only yield minor speedup.

Our AccD design achieves a considerable speedup on datasets with large size and high dimensionality. For example, on dataset KDD Cup 2004 () and Ipums () in K-means, AccD achieves and speedup over Baseline, and also significantly higher than both TOP and CBLAS implementations. This conclusion can also be extended to KNN-join, such as speedup on dataset KDD Cup 1998 (, ). Since our AccD design can effectively reconcile the benefits from both the GTI optimization and the FPGA acceleration, where the former provides the opportunity to reduce the distance computation at the algorithm level, and the latter boosts the performance from hardware acceleration perspective. More importantly, our AccD design can balance the above two benefits to maximize the overall performance.

Energy Comparison

The energy efficiency of AccD design is also significant. For example, on the K-means algorithm, AccD designs deliver an average better energy efficiency compared with Baseline, which is significantly higher than TOP and CBLAS implementations. There are namely two reasons behind these results: 1) Much lower power consumption. AccD CPU-FPGA design only consumes across all algorithm and dataset settings, whereas Intel Xeon CPU consumes at least and on TOP and CBLAS implementations, respectively; 2) Considerable performance. AccD design achieves a much better speedup (more than on average) compared with the TOP and CBLAS, which contributes to overall design energy-efficiency.

Among these implementations, CLBAS implementation has the lowest energy efficiency, since it relies on multi-core parallel processing capability of the CPU, which improves the performance at the cost of much higher power consumption (average ). TOP only leverages the single-core processing capability of the CPU and achieves moderate performance with effective distance computation reduction, which results in less power consumption (average ) and higher energy efficiency (average ) compared with Baseline. Different from the TOP and CBLAS implementations, AccD design is built upon a low-power platform with considerable performance, which shows a far better energy-performance trade-off.

Vii-C Performance Benefits Analysis

To analyze the performance benefits of AccD CPU-FPGA design in detail, we use K-means as the example algorithm for study. Specifically, we build four implementations for comparison: 1) TOP K-means on CPU; 2) TOP K-means on CPU-FPGA platform; 3) AccD K-means on CPU; 4) AccD K-means on CPU-FPGA platform. Note that TOP K-means is designed for sequential-based CPUs, and no publicly available TOP implementation on CPU-FPGA platforms. For a fair comparison, we implement TOP K-means on CPU-FPGA platform with memory optimizations (inter-group and intra-group memory optimization) and distance computation kernel optimization (Vector-Matrix multiplication). These optimizations improve the data reuse and memory access performance.

Fig. 10: AccD Performance Benefits Breakdown.

We compute the normalized speedup performance of each implementation w.r.t the naive for-loop based K-means implementation on CPU.

As shown in Figure 10, AccD K-means on CPU-FPGA platform can always deliver the best overall speedup performance among these implementations. We also observe that TOP K-means can achieve average speedup on CPU, however, directly porting this optimization towards CPU-FPGA platform could even lead to inferior performance (average ). Even though we manage to add several possible optimizations, applying such fine-grained TI optimization from TOP would still cause a large divergence of computation among points, leading to low data reuse and inefficient memory access.

We also notice that AccD design on CPU achieves lower speedup (average ) compared with the TOP (average ), since its coarse-grained GTI optimization spots a fewer number of unnecessary distance computations. However, when combining AccD design with CPU-FPGA platform, the benefits of AccD GTI optimization become prominent (average ), since it can maintain computation regularity while reducing memory overhead to facilitate the hardware acceleration on FPGA. Whereas, applying optimization to maximize the algorithm-level benefits while ignoring hardware-level properties would result in poor performance, such as the TOP (CPU-FPGA) implementation. Moreover, comparison of AccD (CPU) and AccD (CPU-FPGA) can also demonstrate the effectiveness of using FPGA as the hardware accelerator to boost the performance of the algorithms, which can deliver additional speedup compared with the software-only solution.

Viii Conclusion

In this paper, we present our AccD compiler framework to accelerate the distance-related algorithms on the CPU-FPGA platform. Specifically, AccD leverages a simple but expressive language construct (DDSL) to unify the distance-related algorithms, and an optimizing compiler to improve the design performance from algorithmic and hardware perspective systematically and automatically. Rigorous experiments on three popular algorithms (K-means, KNN-join, and N-body simulation) demonstrate the AccD as a powerful and comprehensive framework for hardware acceleration of distance-related algorithms on the modern CPU-FPGA platforms.

References