I Introduction
Distancerelated algorithm (e.g., Kmeans [LloydKMeans], KNN [altman1992introduction], and Nbody Simulation [NBodysimulation]
) plays a vital role in many domains, including machine learning, computational physics, etc. However, these algorithms often come with high computation complexity, leading to poor performance and limited applicability. To improve their performance, FPGAbased acceleration gains lots of interests from both industry and research field, given its great performance and energyefficiency. However, accelerating distancerelated algorithms on FPGAs requires nontrivial efforts, including the hardware expertise, time and monetary cost. While existing works try to ease this process, they inevitably fall in short in one of the following aspects.
Rely on problemspecific design and optimization while missing effective generalization. There is no such unified abstraction to formalize the definition and optimization of distance algorithms systematically. Most of the previous hardware designs and optimizations [KMeansMicroarray, lin2012k, kdtreeKMeanscolorimage, KNNfpgahls] are heavily coded for a specific algorithm (e.g., Kmeans), which can not be shared with different distancerelated algorithms. Moreover, these ”hardcoded” strategies could also fail to catch up with the everchanging upperlevel algorithmic optimizations and the underlying hardware settings, which could result in a large cost of redesign and reimplementation during the design evolvement.
Lack of algorithmhardware codesign. Previous algorithmic [elkan2003using, ding2015yinyang] and hardware optimizations [lin2012k, kdtreeKMeanscolorimage, KMeansMicroarray, multicoreKMeans, KNNfpgahls] are usually applied separately instead of being combined collaboratively. Existing algorithmic optimizations, most of which are based on Triangle Inequality (TI) [elkan2003using, ding2015yinyang, Topframework, chen2017sweet], are crafted for sequentialbased CPU. Despite removing a large number of distance computations, they also incur high computation irregularity and memory overhead. Therefore, directly applying these algorithmic optimizations to massively parallel platforms without taking appropriate hardwareaware adaption could lead to inferior performance.
Count on FPGAs as the only source of acceleration. Previous works [ParallelArchitecturesKNN, IPcoresKNN, ParameterizedKMeans, Lavenier00fpgaimplementation, KMeansMicroarray, KNNfpgahls] place the whole algorithm on the FPGA accelerator without considering the assists from the computing resource on the host CPU. As a result, their designs are usually limited by the onchip memory and computing elements, and cannot fully exploit the power of the FPGA. Moreover, they miss the full performance benefits from the heterogeneous computing paradigm, such as using the CPU for complex logic and control operations while offloading the computeintensive tasks to the FPGA.
Lack of wellstructured design workflow. Previous works [ParallelArchitecturesKNN, ParameterizedKMeans, kdtreeKMeanscolorimage, lin2012k, KNNfpgahls] follow the traditional way of hardware implementation and require intensive user involvement in hardware design, implementation, and extra manual tuning process, which usually takes long developmenttovalidation cycles. Also, the problemspecific strategy leads to a casebycase design process, which cannot be widely applied to handle different problem settings.
To this end, we present a compilerbased optimization framework, AccD, to automatically accelerate distancerelated algorithms on the CPUFPGA platform (shown in Figure 1). First, AccD provides a Distancerelated DomainSpecific Language (DDSL) as a problemindependent abstraction to unify the description and optimization of various distancerelated algorithms. With the assist of the DDSL, enduser can easily create highlyefficient CPUFPGA designs by only focusing on highlevel problem specification without touching the algorithmic optimization or hardware implementation.
Second, AccD offers a novel algorithmichardware cooptimization scheme to reconcile the acceleration from both sides. At the algorithmic level, AccD incorporates a novel Generalized Triangle Inequality (GTI) optimization to eliminate unnecessary distance computations, while maintaining the computation regularity to a large extent. At the hardware level, AccD employs a specialized data layout to enforce memory coalescing and an optimized distance computation kernel to accelerate the distance computations on the FPGA.
Third, AccD leverages both the host and accelerator side of the CPUFPGA heterogeneous system for acceleration. In particular, AccD distributes the algorithmlevel optimization (e.g., data grouping and distance computation filtering) to CPU, which consists of complex operations and execution dependency, but lacks pipeline and parallelism. On the other hand, AccD assigns hardwarelevel acceleration (e.g., distance computations) to the FPGA, which is composed of simple and vectorizable operations. Such mapping successfully capitalizes the benefit of CPU for managing controlintensive tasks and the advantage of FPGA for accelerating computationintensive workloads.
Lastly, AccD compiler integrates an intelligent Design Space Explorer (DSE)
to pinpoint the ”optimal” design for different problem settings. In general, there is no existing ”one size fits all” solution: the best configuration for algorithmic and hardware optimization would differ across different distancerelated algorithms or different inputs of the same distancerelated algorithm. To produce a highquality optimization configuration automatically and efficiently, DSE combines the design modeling (performance and resource) and Genetic Algorithm to facilitate the design space search.
Overall, our contributions are:

We propose the first optimization framework that can automatically optimize and generate highperformance and powerefficient designs of distancerelated algorithms on CPUFPGA heterogeneous computing platforms.

We develop a Domainspecific Language, DDSL, to unify different distancerelated algorithms in an effective and succinct manner, laying the foundation for general optimizations across different problems.

We build an optimizing compiler for the DDSL, which automatically reconciles the benefits from both the algorithmic optimization on CPU and hardware acceleration on FPGA.

Intensive experiments on several popular algorithms across a wide spectrum of datasets show that AccDgenerated CPUFPGA designs could achieve speedup and better energyefficiency on average compared with standard CPUbased implementations.
Ii Related Work
Previous research accelerates distancerelated algorithms in two aspects: Algorithmic Optimization and Hardware Acceleration. More details are discussed in the following subsections.
Iia Algorithmic Optimization
From the algorithmic standpoint, previous research highlights two optimizations. The first one is KDtree based optimization [KDTreeKMeans, efficientKmeans, KNNJoinsDataStreams, 5952342, Zhong:2013:GEI:2505515.2505749], which relies on storing points in special data structures to enable nearest neighbor search without computing distances to all target points. These methods often deliver performance improvement [KDTreeKMeans, efficientKmeans, KNNJoinsDataStreams, 5952342, Zhong:2013:GEI:2505515.2505749] compared with the unoptimized versions in low dimensional space, while suffering from a serious performance degradation when handling large datasets with high dimension () due to their exponentiallyincreased memory and computation overhead.
The second one is TI based optimization [elkan2003using, ding2015yinyang, Topframework, chen2017sweet], which aims at replacing computationexpensive distance computations with cheaper bound computations, demonstrates its flexibility and scalability. It can not only reduce the computation complexity at different levels of granularity but is also more adaptive and robust to the datasets with a wide range of size and dimension. However, most existing works focus on one specific algorithm (e.g., KNN [chen2017sweet], Kmeans [elkan2003using, ding2015yinyang], etc.), which lack extensibility and generality across different distancerelated problems. An exception is a recent work, TOP [Topframework], which builds a unified framework to optimize various distancerelated problems with pure TI optimization on CPUs. Our work shares a similar highlevel motivation with their work, but targets at a more challenging scenario: algorithmic and hardware cooptimization on CPUFPGA platforms.
IiB Hardware Acceleration
From the hardware perspective, several FPGA accelerator designs have been proposed, but still suffer from some major limitations.
First, previous FPGA designs are generally built for specific distancerelated algorithm and hardware. For example, works from [KMeansMicroarray, kdtreeKMeanscolorimage, lin2012k] target on KNN FPGA acceleration, while researches from [KNNfpgahls, IPcoresKNN, ParallelArchitecturesKNN] focus on Kmeans. Moreover, previous designs [lin2012k, KMeansMicroarray] usually assume that dataset can be fully fit into the FPGA onchip memory, and they are only evaluated on a limited number of small datasets, for example, in [lin2012k], Kmeans acceleration is evaluated on a microarray dataset with only 2,905 points. These designs often encounter portability issues when transferring to different settings. Besides, these ”hardcoded” designs and optimizations create difficulties for a fair comparison among different designs, which hamper future studies in this direction.
The second problem with previous works is that they fail to incorporate algorithmic optimizations in the hardware design. For example, works from [KMeansMicroarray, ParallelArchitecturesKNN, kdtreeKMeanscolorimage, KNNfpgahls], directly port the standard Kmeans and KNN algorithms to FPGA, and only apply hardwarelevel optimization. One exception is a recent work [KPynq], which promotes to combine TI optimization and FPGA acceleration for Kmeans. It gives a considerable speedup compared to stateoftheart methods, showcasing the great opportunity of applying algorithmhardware cooptimization. Nevertheless, this idea is far from wellexplored, possibly because it requires the domain knowledge and expertise from both the algorithm and hardware to combine both of them effectively.
In addition, previous works largely focus on the traditional hardware design flow, which requires a long implementation cycle and huge manual efforts. For example, works from [ParameterizedKMeans, KMeansMicroarray, multicoreKMeans, kdtreeKMeanscolorimage, ICSICT2016, Lavenier00fpgaimplementation, adaptiveKNNPartialReconfiguration, adaptiveKNN] build the design based on VHDL/Verilog design flow, which requires hardware expertise and over months of arduous development. In contrast, our AccD design flow brings significant advantages of programmability and flexibility due to its highlevel OpenCLbased programming model, which minimizes the user involvement in the tedious hardware design process.
Iii Distancerelated Algorithm DomainSpecific Language (DDSL)
Distancerelated algorithms share commonalities across different application domains and scenarios, even though they look different in their highlevel algorithmic description. Therefore, it is possible to generalize these distancerelated algorithms. AccD framework defines a DDSL, which provides a highlevel programming interface to describe distancerelated algorithms in a unified manner. Unlike the APIbased programming interface used in the TOP framework [Topframework], DDSL is built on Clike language and provides more flexibility in lowlevel control and performance tuning support, which is crucial for FPGA accelerator design.
Specifically, DDSL utilizes several constructs to describe the basic components (Definition, Operation, and Control) of the distancerelated algorithms, and also identify the potential parallelism and pipeline opportunities during the design time. We detail these constructs in the following part of this section.
Iiia Data Construct
Data construct is a basic Definition Construct. It leverages DSet primitive to indicate the name of the data variable, and the DType primitive to notate the type characters of the defined variable. Data construct serves as the basis for AccD compiler to understand the algorithm description input, such as the data points that are used in the distancerelated algorithms. An example of data constructs is shown in the code below, where we define the variable and dataset using DDSL data construct.
In most distancerelated algorithms, the dataset can be defined as the source set and the target set. For example, in Kmeans, the source set is the set of data points, and the target set is the set of clusters. Currently, AccD supports several data types including int (32bit), float (32bit), double (64bit) based on the users’ requests, algorithm performance, and accuracy tradeoffs.
IiiB Distance Compute Construct
Distance computation is the core Operation Construct for distancerelated algorithms, which measures the exact distance between two different data points. This construct requires several fields, including data dimensionality, distance metrics, and weight matrix (if weighted distance is specified).
p1, p2  Input data matrix. (, ) 
disMat  Output distance matrix. () 
idMat  Output id matrix. () 
dim  Dimensionality of input data point. 
mtr  Distance metric:(Weighted—Unweighted) 
mat  Weight matrix: Used for weighted distance () 
IiiC Distance Selection Construct
Distance selection construct is an Operation Construct for distance value selection and it returns the TopK smallest or largest distances and their corresponding points ID number from the provided distance and ID list. This construct helps AccD compiler to understand the distances of users’ interests.
TopKMat  TopK id matrix () 
ran  Scalar value of K (e.g., Kmeans, KNN) or distance threshold (e.g., Nbody Simulation) 
scp  TopK (smallest—largest) values 
IiiD Data Update Construct
Data update construct is an Operation Construct for updating the data points based on the results from the prior constructs. For example, Kmeans updates the cluster centers by averaging the positions of the points inside. This construct requires the variable to be updated and additional information to finish this update, such as the pointtocluster distances. The status of this data update will be returned after the completion of all its inside operations. The status variable is to tell whether the data update makes a difference or not.
upVar  Input data/dataset to be updated 
p1, …, pm  Additional information used in update 
Status of update operation. 
IiiE Iteration Construct
Iteration construct is a toplevel Control Construct. It is used to describe the distancerelated algorithms that require iteration, such as Kmeans. Iteration construct requires users to provide either the maximum number of iteration or other exit condition.
IiiF Example: Kmeans
To show the expressiveness of DDSL, we take Kmeans as an example. From the code shown below, with no more than 20 lines of code, DDSL can capture the key components of userdefined Kmeans algorithm, which is essential for AccD compiler to generate designs for CPUFPGA platforms.
Iv Algorithm Optimization
This section explains a novel TI optimizations tailored for CPUFPGA platforms. TI has been used for optimizing distancerelated problems, but is often on the sequential processing systems. Our design features an innovative way of applying TI to obtain lowoverhead distance bounds for unnecessary distance computation elimination while maintaining the computation regularity to ease the hardware acceleration on FPGAs.
Iva TI in Distancerelated Algorithm
As a simple but powerful mathematical concept, TI has been used to optimize the distancerelated algorithm. Figure 2a gives an illustration. It states that , where represents the distance between point and in some metrics (e.g., Euclidean distance). The assistant point is a landmark point for reference. Directly from the definition, we could compute both the lower bound () and upper bound () of the distance between two points A and B. This is the standard and most common usage of TI for deriving bounds of distance.
In general, bounds can be used as a substitute for the exact distances in the distancerelated data analysis. Take Nbody simulation as an example. It requires to find target points that are within (the radius) from each given query point. Suppose we get the and , then we are 100% confident that source point is not within of query point . As a result, there is no need to compute the exact distance between point and . Otherwise, the exact distance computation will still be carried out for direct comparison. While many previous researches [elkan2003using, ding2015yinyang, lin2012k, MakingKMeansFaster, KNNAdaptiveBound] gain success in directly porting the above pointbased TI to optimize distancerelated algorithms, they usually suffer from memory overhead and computations irregularity, which result in inferior performance.
IvB Generalized Triangle Inequality (GTI)
AccD uses a novel Generalized TI (GTI) to remove redundant distance computation. It generalizes the traditional pointbased TI while significantly reducing the overhead of bound computations. The traditional pointbased TI focuses on tighter bound (more closer to the exact distance) to remove more distance computations, but it induces the extra bound computations, which could become the new performance bottleneck even after many distance calculations being removed. In contrast, GTI strikes a good balance between distance computation elimination and bound computation overhead. In particular, AccD highlights GTI from three perspectives: Twolandmark bound computation, Tracebased bound computation, and Grouplevel bound computation.
Twolandmark Bound Computation
Twolandmark scheme aims at reducing the bound computation through effective distance reuse. In this case, the distance bound between two points can be measured through two landmarks as the reference points. As illustrated in Figure 2b, the distance bound between point and can be computed based on , and through Equation 1, where and are the landmark points for point and , correspondingly.
(1)  
One representative application scenario of Twolandmark bound computation is KNNjoin, where two disjoint sets of landmarks are selected for the query and target point set. In this case, much fewer bound computations are required compared with the onelandmark case (shown in Figure 2a). This can also be validated through a simple calculation. Assuming in KNNjoin, we have query points, target points, query landmarks, and target landmarks. Also, we have and in general. Therefore, we can get bound computations for Twolandmark case, which is much smaller than Onelandmark bound computation ( or ).
Tracebased Bound Computation
Tracebased bound computation finds its strength in iterative distance algorithms with points update, since it can largely reduce the bound computation overhead over numbers of iterations. The key to Tracebased bound computation is selecting appropriate landmark points as references. For example, in Kmeans, only the target points (clusters) change their positions across iterations, therefore, we can choose the previous positions of clusters from the last iteration as the landmarks for bound computation in the current iteration, since these ”old” cluster positions can be close enough to the current point positions to offer ”tight” bound. This process can be illustrated in Figure 2c, where the distance bound can be calculated based on and , where is the new point position while is the old point position from the last iteration.
In addition, Tracebased bound computation can also work collaboratively with the Twolandmark cases. For example, in Nbody simulation, the source and target points are essentially the same dataset and would get updated across iterations. We can choose the ”old” position of each point from the last iteration as the landmark for the bound computation at the current iteration, due to its closeness towards the current point position. This case can be clarified in Figure 2d, where and are the ”old” point positions from the last iteration, and are the new source and target point. Then based on , and , the new distance bound between and can be easily derived by using the old points and as the reference points. And the cost of this is also as low as , where is the number of particles, since each point only need to maintain the shifted distance between its new position and old position from the last iteration (Note: we only compute the interpoint distances at the first iteration). In contrast, only applying Twolandmark without the effective temporal reuse of the old point position will result in the complexity at least , where is number of landmarks. Since, in this case, distances between each point and all the landmarks have to be computed, so that each point can know its new closest landmark before applying the bound computation.
Grouplevel Bound Computations
Grouplevel bound computation aims at reducing the bound computation overhead while maintaining the computation regularity. Grouplevel bound computations features itself with the capability to combine with aforementioned two bound cases as the hybrid bound computation. In the combination with the Twolandmark case, as shown in Figure 2e, points in each group ( and ) share the same landmark ( and ) as the reference point. Then based on and the and , we can get the grouplevel bound based on Equation 2, where and get the distance between the farthest point within each group and its group reference point.
(2)  
In the combination with the Tracebased case, it will generate a hierarchical bound as a hybrid solution, which includes pointgroup bound and pointpoint bound computation. As exemplified in Figure 2f, each group regards its old group center as the landmark for reference, and each point relies on its old position as the landmark for reference. Then based on , , and , and the old distance , and , where and are the point groups, and point is the closest point of point in the last iteration. We can calculate and based on Equation 3,
(3)  
where . If we have , it is impossible that the points inside the group and can become the closest point of in the current iteration. Therefore, the distance computation between the point and all points inside these groups can be safely avoided.
In addition to distance saving, grouplevel bound computation offers another two benefits to facilitate the underlying hardware acceleration. First, the computation regularity on the remaining distance computation becomes higher compared with the pointlevel bound computation. Since points inside each group will share the commonality in computation, which facilitates the parallelization for acceleration. For example, pointlevel bound computation usually results in a large divergence of distance computation among different points, as shown in Figure 3a, which is a killer of parallelization and pipelining. However, in grouplevel bound computation, points inside the same source group will always maintain the same groups of target points for distance computation, as shown in Figure 3b.
Second, grouplevel bound computation brings the benefit of reducing memory overhead. Assuming we have source points, target points, source groups, and target groups. The memory overhead of maintaining distance bounds is in the pointlevel bound computation case. However, in the grouplevel bound computation case, we only have to maintain distance bounds among groups, and the memory overhead is , where and . Therefore, in terms of memory efficiency, grouplevel bound computation can outperform the pointlevel bound computation to a great extent.
V Hardware Acceleration
AccD design is built on the CPUFPGA architecture, which highlights its significant performance and energy efficiency, and has been widely adopted as the modern data center solution for highperformance computing and acceleration. The hostside application of AccD design is responsible for data grouping and distance computation filtering, which consists of complex operations and execution dependency, but lacks pipeline and parallelism. On the other hand, the FPGAside of AccD design is built for accelerating the distance computations, which are composed of simple and vectorizable operations.
While FPGA accelerator features with high computation capability, the memory bandwidth bottleneck constraints the overall design performance. Therefore, optimizing data placement and memory architecture is the key to improving memory performance. In addition, the OpenCLbased programming model adds a layer of architectural complexity of the kernel design and management, which is also critical to the design performance. AccD framework distinguishes itself by using a novel memory and kernel optimization strategy that is tailored for TIoptimized distancerelated algorithms to benefit CPUFPGA designs.
Va Memory Optimization
After applying the GTI optimization to remove the redundant distance computation, each source point group will have different target groups as candidates for distance computation, as shown in Figure 4a, where Sourcegrp is ID of the source group, and Targetgrp is ID of the target group. However, this would raise two concerns about performance degradation.
The first issue is intergroup memory irregularity and low data reuse. For example, the target group information (, , ) required by source group can not be reused by . Since requires quite different target groups (, , and ) for distance computation, thus, additional costly memory access has to be carried out. To tackle this problem, AccD places the source groups to the continuous memory space to maximize the memory access efficiency, only if these source groups have the same set of target groups as candidates for distance computation. An example has been shown in Figure 4b, where the source group and are placed side by side in the memory, since they have the same list of target groups (, , and ), which can take advantage of the memory temporal locality without issuing another memory access.
The second issue is intragroup memory irregularity. For example, points from group 1, 2, and 3 have taken up the memory space at intervals, as shown in Figure 5a. However, a group of points are usually accessed simultaneously due to GTI optimization. This would cause frequent inefficient memory access for fetching individual point distributed at the discontinuous memory address. To solve this issue, AccD offers a second memory optimization to reorganize the target/source points inside the same target/source group into continuous memory space within the same memory bank, as illustrated in Figure 5b. This strategy can largely benefit memory coalescing and external memory bandwidth while minimizing the access contention, since points inside the same bank can be accessed efficiently and points inside different banks can be accessed in parallel.
VB Distance Computation Kernel
Distance computation takes the major time complexity in distancerelated algorithms. In AccD, after TI filtering on CPU, the remaining distance computations are accelerated on FPGA. Points involved in the remaining distance computations are organized into two sets: source set and target set, which can be organized as two matrices, () and (), respectively, where each row of these matrices represents a point with dimension. The distance computation between and can be decomposed into three parts, as shown in Equation 4,
(4) 
where or only takes the complexity of and , while takes , which dominates the overall computation complexity. AccD spots an efficient way of accelerating through highlyefficient matrixmatrix multiplication, which can benefit the hardware implementation on FPGA.
The overall computation process can be described as Figure 6, the source () and target set (
) Rowwise Square Sum (RSS) is precomputed through in a fullyparallel manner. And the vector multiplication between each source and target point is mapped to an OpenCL kernel thread for a finegrained parallelization. Moreover, a block of threads, as highlighted in the ”red” square box of Figure
6, is the kernel thread workgroup, which can share a part of the source and target points to increase the onchip data locality. Based on this kernel organization, AccD hardware architectural design offers several tunable hyperparameters for performance and resource tradeoff: the size of kernel block, the number of parallel pipeline in each kernel block, etc. To efficiently find the ”optimal” parameters that can maximize overall performance while respecting the constraints, we harness the AccD explorer for efficient design space search, which is detailed in Section
VIB.Vi AccD Compiler
In this section, we detail AccD compiler in two aspects: design parameters and constraints, and design space exploration.
Via Design Parameters and Constraints
AccD uses a parameterized design strategy for better design flexibility and efficiency. It takes the design parameters and constraints from algorithm and hardware to explore and locate the ”optimal” design point tailored for the specific application scenario. At the algorithm level, the number of groups affects distance computation filtering performance. At the hardware level, there are three parameters: 1) Size of computation block, which decides the size of data shared by a group of computing elements; 2) SIMD factor, which decides the number of computing elements inside each computation block; 3) Unroll factor, which tells the degree of parallelization in each single distance computation. In addition, there are several hardware constraints, such as the onchip memory size, the number of logic units, and the number of registers. All of these parameters and constraints are included in our analytical model for design exploration.
ViB Design Space Exploration
Finding the best combination of design configurations (a set of hyperparameters) under the given constraints requires nontrivial efforts in the design space search. Therefore, we incorporate an AccD explorer in our compiler framework for efficient design space exploration (Figure 7). AccD explorer takes a set of raw configurations (hyperparameters) as the initial input, and generates the optimal configuration as the output through several iterations of the design configuration optimization process. In particular, AccD explorer consists of three major phases: Configuration Generation and Selection, Performance and Resource Modeling, Constraints Validation.
Configuration Generation and Selection
The functionality of this phase depends on its input. There are two kinds of inputs: If the input is from the initial configurations, this phase will directly feed these configurations to the modeling phase for performance and resource evaluation; If the input is the result from the constraints validation in the last iteration, this phase will leverage the genetic algorithm to crossover the ”premium” configurations kept from the last iteration, and generate a new set of configurations for the modeling phase.
Performance Modeling
Performance modeling measures the design latency and bandwidth requirement based on the input design configurations. We formulate the design latency by using Equation 5,
(5) 
where and are the time of the GTI filtering process and remaining distance computations, respectively. And they can be calculated as Equation 6,
(6)  
where and are the number of groups for source and target points, respectively; and are the number of points inside source and target set, respectively; is the data dimensionality; is the number of grouping iteration; is the size of computation kernel block; is the FPGA design clock frequency; is the distance computation unroll factor; is the number of parallel worker threads inside each computation block; is the distance saving ratio through GTI filtering (Equation 7),
(7) 
where the the density of points distribution. This formula also tells that increasing of number of iterations and number of points inside each group would improve the performance of GTI filtering performance. Also, the increase of points distribution density , (i.e. points are closer to each other) will decrease the GTI filtering performance.
To get the required bandwidth of the current design, we leverage Equation 8,
(8) 
where the can be either 32bit for int and float or 64bit for double.
Resource Modeling
Directly measuring the hardware resource usage of the accelerator design from highlevel algorithm description is challenging because of the hidden transformation and optimization in hardware design suite. However, AccD uses a microbenchmark based methodology to measure the hardware resource usage by analytical modeling. The major hardware resource consumption of AccD comes from the distance computation kernel, which depends on several design factors, including the kernel block size, the number of SIMD workers, etc.
In AccD resource analytical model, the design factors are classified into two categories:
datasetdependent and datasetindependentfactors. The main idea behind the AccD resource modeling is to get the exact hardware resource consumption statistics through microbenchmark on the hardware designs with different datasetindependent factors. For example, we can benchmark a single distance computation kernel with different sizes of computation block to get its resource statistics. Since this factor is datasetindependent, which can be decided before knowing the dataset details. However, to estimate the resource consumption for datasets with different sizes and dimensionalities, AccD leverages the formulabased approach to estimate the overall hardware resource consumption (Equation
9), which combines online information (e.g., kernel organization, and dataset properties) and offline information (e.g., mirobenchmark statistics).(9) 
where the types of can be onchip memory, computing units, and logic operation units; is the estimated overall usage of a certain type resource for the overall design; is the usage of a certain type of resource for only one distance computation kernel block.
Constraints Validation
Constraints validation is the third phase of AccD explorer, which checks whether the design resources consumption of a given configuration is within the budget of the given hardware platform. The input of this phase are the resource estimation results from resource modeling step. The design constraint inequalities are listed in Equation 10, which includes (the size of onchip memory), (the bandwidth of data communication between external memory and onchip memory), (the number of computing units) and (the number of logic units):
(10)  
Constraints validation phase will also discard the configurations that cannot match the design performance and constraints, and only keep the ”wellperformed” configurations for further optimization in the next iteration. The constraints validation phase will also record the modeling information of the best configuration statistics in the last iteration, which will be used to terminate the optimization process if the modeling results difference between the configurations in two consecutive iterations is lower than a predefined threshold. This strategy can also help to avoid unnecessary time cost. After termination of the AccD explorer, the ”best” configuration with maximum design performance under the given constraints will be output as the ”optimal” solution for the AccD design.
Vii Evaluation
In this section, we choose three representative benchmarks (Kmeans, KNNjoin, and Nbody Simulation) and evaluate their corresponding AccD designs on the CPUFPGA platform.
Kmeans
Kmeans [LloydKMeans, dataclustering50, efficientKmeans, coates2012learning, ray1999determination] clusters a set of points into several groups in an iterative manner. At each iteration, it first computes the distances between each point and all clusters, and then update the clusters based on the average position of their inside points. We choose it as our benchmark since it can show the benefits of AccD hierarchy (Tracebased + Grouplevel) bound computation optimization on iterative algorithms with disjoint source and target set.
KNNjoin
KNNjoin Search [altman1992introduction, KNNJoinsHybridApproach, KNNJoinsDataStreams] finds the TopK nearest neighbor points for each point in the source set from the target set. It first computes the distances between each source point and all the target points. Then it ranks the Ksmallest distances for each source point and gets its corresponding closest TopK target points. KNNjoin can help to demonstrate the effectiveness of AccD hybrid (Twolandmark + Grouplevel) bound computation optimization on noniterative algorithms.
Nbody Simulation
Nbody Simulation [nylons2007fast, ida1992n] mimics the particle movement within a certain range of 3D space. At each time step, distances between each particle and its neighbors (within a radius ) are first computed, and then the acceleration and the new position of each particle will be updated based on these distances. While Nbody simulation is also iterative, it has several differences compared with Kmeans algorithm: 1) Nbody simulation has the same dataset (particles) for source and target set, whereas Kmeans operates on different source (point) and target (cluster) sets; 2) All points in the Nbody simulation would change their positions according to the time variation, whereas in Kmeans only the target set (cluster) would change their positions during the center update; 3) Nbody simulation has the same size of source and target set, whereas Kmeans target set (cluster) is much smaller than source set (point) in general. Nbody simulation can help us to show the strength of AccD hybrid bound computation (Twolandmark + Tracebased + Grouplevel) on iterative algorithms with the same source and target set.
Viia Experiment Setup
Tools and Metrics
In our evaluation, we use Intel Stratix 10 DE10Pro [DE10Pro] as the FPGA accelerator and run the host side software program on Intel Xeon Silver 4110 processor [intelXeon] (8core 16thread, 2.1GHz base clock frequency, 85W TDP). DE10Pro FPGA has 378,000 Logic elements (LEs), 128,160 adaptive logic modules (ALM), 512,640 ALM registers, 648 DSPs, and 1,537 M20K memory blocks. We implement AccD design on DE10Pro by using Intel Quartus Prime Software Suite [IntelQuartus] with Intel OpenCL SDK included. To measure the system power consumption (Watt) accurately, we use the offtheshelf Poniie PN2000 as the external power meter to get the runtime power of Xeon CPU and DE10 Pro FPGA.
Name  Techniques  Description 
Baseline  Standard Algorithm without any optimization, CPU.  Naive forloop based implementation on CPU. 
TOP  Pointbased Triangleinequality Optimized Algorithms, CPU.  TOP [Topframework] optimized distancerelated algorithm running on CPU. 
CBLAS  CBLAS library Accelerated Algorithms, CPU.  Standard distancerelated algorithm with CBLAS [openblas] acceleration. 
AccD  Algorithmichardware codesign, CPUFPGA platform.  GTI filtering and FPGA acceleration of distance computations. 
Kmeans  KNNjoin  Nbody Simulation  
Dataset  Size  Dimension  #Cluster  Dataset  Dimension  #Source  Dataset  #Particle 
Poker Hand  25,010  11  158  Harddrive1  64  68,411  P1  16,384 
Smartwatch Sens  58,371  12  242  Kegg Net Directed  24  53,413  P2  32,768 
Healthy Older People  75,128  9  274  3D Spatial Network  3  434,874  P3  59,049 
KDD Cup 2004  285,409  74  534  KDD Cup 1998  56  95,413  P4  78,125 
Kegg Net Undirected  65,554  28  256  Skin NonSkin  4  245,057  P5  177,147 
Ipums  70,187  60  265  Protein  11  26,611  P6  262,144 






Implementations
The CPUbased implementations consist of three types of programs: the naive forloop sequential implementation without any optimization (selected as our Baseline to normalize the speedup and energyefficiency), the algorithm optimized by TOP [Topframework] framework and the algorithm optimized by CBLAS [openblas] computing library. Note that the TOP + CBLAS implementation is not included in our evaluation, since after applying TOP pointbased TI filtering, each point in the source set has a distinctive list of points from the target set for distance computation, whereas CBLAS requires uniformity in the distance computations. Therefore, it is challenging to combine TOP and CBLAS optimization.
Dataset
In the evaluation, we use six datasets for each algorithm. The selected datasets can cover the wide spectrum of mainstream datasets, including datasets from UCI Machine Learning Repository [UCIdataset], and datasets that have ever been used by previous papers [Topframework, ding2015yinyang, chen2017sweet] in the related domains. Details of these datasets are listed in Table V. Note that KNNjoin algorithm will find the Top1000 closest neighbors of each query point.
ViiB Comparison with Software Implementation
Performance Comparison
As shown in Figure 8, TOP, CBLAS, and AccD achieve average , and compared with Baseline across all algorithm and dataset settings, respectively. As we can see, AccD design can always maintain the highest speedup among these implementations. This largely dues to AccD GTI optimization in reducing distance computation and its efficient hardware acceleration of the distance computation on FPGA.
We also observe that TOP implementation shows its strength for large datasets. For example, on dataset 3D Spatial Network () in KNNjoin, TOP implementation achieves speedup. Since the finegrained pointbased TI optimization of TOP can reduce most (more than 90%) of the unnecessary distance computations, which benefits the overall performance to a great extent. Note that the intrinsic point distribution of the dataset would also affect the filtering performance of TOP, but in general, the larger dataset could lead TOP to spot and remove more redundant computations.
What we also notice is that CBLAS implementation demonstrates its performance on datasets with relatively high dimensionality. For example, on dataset KDD Cup 2004 () in the Kmeans algorithm, CBLAS achieves speedup over Baseline, which is higher than its performance on other Kmeans datasets. This is because, on high dimension dataset, CBLAS implementation can get more benefits of parallelized computing and more regularized memory access, whereas, in low dimension settings, the same optimization can only yield minor speedup.
Our AccD design achieves a considerable speedup on datasets with large size and high dimensionality. For example, on dataset KDD Cup 2004 () and Ipums () in Kmeans, AccD achieves and speedup over Baseline, and also significantly higher than both TOP and CBLAS implementations. This conclusion can also be extended to KNNjoin, such as speedup on dataset KDD Cup 1998 (, ). Since our AccD design can effectively reconcile the benefits from both the GTI optimization and the FPGA acceleration, where the former provides the opportunity to reduce the distance computation at the algorithm level, and the latter boosts the performance from hardware acceleration perspective. More importantly, our AccD design can balance the above two benefits to maximize the overall performance.
Energy Comparison
The energy efficiency of AccD design is also significant. For example, on the Kmeans algorithm, AccD designs deliver an average better energy efficiency compared with Baseline, which is significantly higher than TOP and CBLAS implementations. There are namely two reasons behind these results: 1) Much lower power consumption. AccD CPUFPGA design only consumes across all algorithm and dataset settings, whereas Intel Xeon CPU consumes at least and on TOP and CBLAS implementations, respectively; 2) Considerable performance. AccD design achieves a much better speedup (more than on average) compared with the TOP and CBLAS, which contributes to overall design energyefficiency.
Among these implementations, CLBAS implementation has the lowest energy efficiency, since it relies on multicore parallel processing capability of the CPU, which improves the performance at the cost of much higher power consumption (average ). TOP only leverages the singlecore processing capability of the CPU and achieves moderate performance with effective distance computation reduction, which results in less power consumption (average ) and higher energy efficiency (average ) compared with Baseline. Different from the TOP and CBLAS implementations, AccD design is built upon a lowpower platform with considerable performance, which shows a far better energyperformance tradeoff.
ViiC Performance Benefits Analysis
To analyze the performance benefits of AccD CPUFPGA design in detail, we use Kmeans as the example algorithm for study. Specifically, we build four implementations for comparison: 1) TOP Kmeans on CPU; 2) TOP Kmeans on CPUFPGA platform; 3) AccD Kmeans on CPU; 4) AccD Kmeans on CPUFPGA platform. Note that TOP Kmeans is designed for sequentialbased CPUs, and no publicly available TOP implementation on CPUFPGA platforms. For a fair comparison, we implement TOP Kmeans on CPUFPGA platform with memory optimizations (intergroup and intragroup memory optimization) and distance computation kernel optimization (VectorMatrix multiplication). These optimizations improve the data reuse and memory access performance.
We compute the normalized speedup performance of each implementation w.r.t the naive forloop based Kmeans implementation on CPU.
As shown in Figure 10, AccD Kmeans on CPUFPGA platform can always deliver the best overall speedup performance among these implementations. We also observe that TOP Kmeans can achieve average speedup on CPU, however, directly porting this optimization towards CPUFPGA platform could even lead to inferior performance (average ). Even though we manage to add several possible optimizations, applying such finegrained TI optimization from TOP would still cause a large divergence of computation among points, leading to low data reuse and inefficient memory access.
We also notice that AccD design on CPU achieves lower speedup (average ) compared with the TOP (average ), since its coarsegrained GTI optimization spots a fewer number of unnecessary distance computations. However, when combining AccD design with CPUFPGA platform, the benefits of AccD GTI optimization become prominent (average ), since it can maintain computation regularity while reducing memory overhead to facilitate the hardware acceleration on FPGA. Whereas, applying optimization to maximize the algorithmlevel benefits while ignoring hardwarelevel properties would result in poor performance, such as the TOP (CPUFPGA) implementation. Moreover, comparison of AccD (CPU) and AccD (CPUFPGA) can also demonstrate the effectiveness of using FPGA as the hardware accelerator to boost the performance of the algorithms, which can deliver additional speedup compared with the softwareonly solution.
Viii Conclusion
In this paper, we present our AccD compiler framework to accelerate the distancerelated algorithms on the CPUFPGA platform. Specifically, AccD leverages a simple but expressive language construct (DDSL) to unify the distancerelated algorithms, and an optimizing compiler to improve the design performance from algorithmic and hardware perspective systematically and automatically. Rigorous experiments on three popular algorithms (Kmeans, KNNjoin, and Nbody simulation) demonstrate the AccD as a powerful and comprehensive framework for hardware acceleration of distancerelated algorithms on the modern CPUFPGA platforms.
Comments
There are no comments yet.