Field Programmable Gate Arrays (FPGAs) have become increasingly popular since their introduction in [trimberger2018three]. Due to their (partial) run-time reconfigurability, short time-to-market, and lower prototype costs, as compared to Application-Specific Integrated Circuits (ASICs), FPGAs are preferred in a wide variety of applications. These comprise domains like high-performance computing clusters and server platforms that offer “FPGAs as a Service”, and embedded and cyber-physical systems, which perform complex data-computations on the configurable arrays [watanabe2019implementation]. The current generation of FPGAs are equipped with a wide range of capabilities that can be used to design a Programmable System (on a Chip) by including hard IPs (IC realization) of the low-power ARM A9 processor core and other commonly used hardware accelerators, such as video codecs [crockett2014zynq]. However, FPGAs are low-performance, power-hungry devices that are a lot less energy-efficient when compared to ASICs.
The Approximate Computing paradigm offers a direction of research, in which the intermediate computational units can be approximated without “significantly” degrading the output quality, to obtain savings in power/energy consumption and latency [chippa2013analysis]. This quality of error-tolerance is exhibited by applications in the fields of recognition, mining, and synthesis, due to the following four factors:
redundancy in the processed data,
algorithms with error attenuating patterns,
non-existence of a unique golden output, and
imperceptible differences in the output quality by end-users. Since its re-emergence, plenty of research works from academia and industry have exploited this phenomenon across the hardware [jiang2015comparative, jiang2016comparative, mittal2016survey, hashemi2015drum, saadat2019approximate, saadat2018minimally, venkataramani2013quality, sampson2011enerj, prabakaran2018demas, echavarria2016fau, ullah2018smapproxlib, ullah2018area] and software [mishra2014iact, baek2010green, khudia2015rumba, yazdanbakhsh2015axilog] layers to obtain power/energy/latency savings.
Most of the current works on approximate circuits (AC) primarily focus on obtaining energy/power/latency savings in ASIC-based systems. Previous studies have illustrated that ASIC-based approximate computing principles and techniques offer asymmetric savings when implemented on FPGAs [prabakaran2018demas] [ullah2018smapproxlib] [ullah2018area]. State-of-the-art ACs for adders and multipliers can offer up to % savings in energy when synthesized for ASICs. Whereas these designs offer minimal/asymmetric savings or at times negative savings, i.e.
, an increase in resources when synthesized for FPGAs. This is primarily due to the architectural differences between ASICs and FPGAs. The required functionality is realized using logic gates in ASICs and using Lookup Tables (LUTs) made of SRAM elements in FPGAs. Therefore, an AC that offers significant savings and introduces the least error (pareto-optimal) for ASICs, might not necessarily be pareto-optimal for FPGAs. Note, by pareto-optimal approximate circuits we mean the set of all circuits that are not dominated by any other circuit from the set of circuits in the library in terms of the evaluation metrics.
Furthermore, the works presented in [echavarria2016fau, prabakaran2018demas, ullah2018area, ullah2018smapproxlib] have developed FPGA-based approximate circuits by analyzing the architecture of the target FPGA. These techniques are typically not scalable, due to their manual lookup table optimizations and approximations, and do not offer multiple pareto-optimal design points that trade-off between power consumption and introduced error. To further illustrate these behavioral differences between ASICs and FPGAs, we present a motivational analysis of our work in the next sub-section.
I-a Motivational Analysis
We synthesize and implement a small subset of unsigned approximate multiplier designs from the library of evolutionary approximate arithmetic circuits [mrazek2017evoapprox8b] and the state-of-the-art FPGA-based approximate multiplier designs [ullah2018area]. These circuits were synthesized and implemented for the Xilinx xc7vx485tffg1157-1 FPGA using the Vivado tool-chain, with zero Digital Signal Processing blocks enabled to ensure that the designs are mapped to the reconfigurable logic (see details in Section III). We also evaluate the output quality of these approximate circuits with the help of their behavioral models by computing their Mean Error Distance (MED), which we define as the average of the absolute error difference across all the input combinations relative to the maximum number of outputs [han2013approximate]. Based on the resources required for each of these designs and their MED, we extract the pareto-front of approximate multipliers for the target FPGA and compare them to the pareto-front obtained when the same designs are synthesized for ASICs. The results of these experiments are presented in Fig. 1. From these results, we make the following key observations:
The ACs that are pareto-optimal for ASICs (ASIC-ACs) are not necessarily pareto-optimal for FPGAs (FPGA-ACs). As discussed earlier, this is primarily due to the differences in realizing the logic functions across the ASIC and FPGA platforms.
The time required for synthesizing only % of the approximate multiplier library is ~ days. This huge time requirement can also be attributed to the architectural differences between FPGAs and ASICs. The synthesis and routing algorithms of an FPGA tool-flow need to map the functionality to existing hardware blocks on the target FPGA while optimizing for various factors and constraints to maximize performance.
State-of-the-art FPGA-based approximate multipliers [ullah2018area] are not pareto-optimal when compared to the % subset of approximate multipliers from [mrazek2017evoapprox8b]. Similarly, the other FPGA-based approximate adders and multipliers presented in [prabakaran2018demas] [ullah2018smapproxlib] are neither pareto-optimal nor scalable. Due to their manual optimizations and circuit designs, they are not effective in achieving similar performance/power trade-offs, as illustrated by the evolutionary approximate arithmetic library for larger bit-widths.
Based on these observations, we have identified the following research challenges:
Based on the time required for synthesizing and implementing a small subset of the designs for the target FPGA, the time required for exhaustively exploring all designs in the data-set would be in the magnitude of s of hours, or a couple of weeks.
How to efficiently reduce the time required for exploring the design-space of approximate arithmetic units in FPGAs?
Can we explore the concepts of machine learning in order to reduce the exploration time by estimating FPGA parameters? If yes, which machine learning algorithm?
There is a nonexistence of pareto-optimal FPGA-ACs, which can offer a design-space trade-off between the resources consumed and the error introduced.
How can we determine a set of pareto-optimal FPGA-ACs that can be deployed in error-tolerant applications to obtain power/energy/latency savings?
Unavailability of a systematic automation framework that can be used to develop FPGA-ACs for a given error-tolerant application and its quality requirements.
How to systematically deploy the FPGA-ACs in a given error tolerant application to maximize performance or power/energy savings?
To address these research challenges, we propose the following novel contributions:
We propose the ApproxFPGAs methodology that deploys machine learning (ML) models, which can be used to estimate the power and latency of the approximate circuits. These ML models are trained using a small subset of the evolutionary approximate circuits [mrazek2017evoapprox8b].
Based on the estimates, we propose to construct a pseudo-pareto-front, which can be used to determine the set of pseudo-pareto-optimal approximate circuits for varying bit-widths of the approximate arithmetic units. These models can then be subsequently synthesized for the target FPGA to measure the exact power and latency of these FPGA-ACs.
We also perform a case-study by deploying these pareto-optimal FPGA-ACs in a state-of-the-art automation framework that can systematically generate approximate accelerators, which can be deployed in FPGA-based systems to achieve high-performance and/or low power/energy consumption.
Ii The ApproxFPGAs Methodology
Overview: Fig. 2 presents an overview of the proposed methodology. The complete procedure can be divided into two sub-parts,
the first part deals with the training and testing of the ML models, which can be used to efficiently estimate the hardware resources of a given approximate arithmetic design, while
the second part deals with the construction of the pareto-optimal FPGA-ACs, which can be deployed in error-tolerant applications.
Inputs: We start by compiling the library of approximate arithmetic circuits that need to be analyzed and deployed in the target application. Without loss of generality, in this work, we consider the evolutionary library of approximate adder and multiplier circuits for illustrating the benefits of our methodology [mrazek2017evoapprox8b]. Note, the use of other state-of-the-art designs is orthogonal to our approach and they can be appropriately included, with necessary modifications, in the library of approximate circuits.
Exhaustive Exploration: Due to the large number of designs present in the library, the time required for exploring all the designs, exhaustively, might be quite large, as initially stated in Section I. Fig. 3 presents a brief illustration of the estimated time required for synthesizing all the approximate circuits present in the library for the target FPGA. As can be observed, when the number of ACs in the library increases, the time required for exploring the designs rises and reaches a magnitude of s of hours. Therefore, exhaustive exploration is not a feasible option for identifying the pareto-optimal approximate circuits for FPGAs. Fig. 3 also illustrates the savings in exploration time when the proposed ApproxFPGAs methodology is used for exploration as opposed to exhaustive exploration. The exploration time is reduced by a factor of ~ from days to days, including the time required for synthesizing the data-set, training and evaluating the ML models, and re-synthesizing the pareto-optimal FPGA-ACs.
ML-Model Learning: Due to the infeasible time requirements of exhaustive exploration, we propose to train and evaluate a wide variety of statistical and machine learning (S/ML) models, which can be used to estimate the resource requirements of an approximate circuit, given its hardware description. These S/ML models can be used to estimate FPGA parameters like power consumption (), latency (), and area (#). Training these models requires a labeled data-set, with the FPGA parameters as the output labels and the hardware description of the AC as the input data. We build this data-set by randomly extracting a % subset of the complete library of ACs and synthesizing them for the target FPGA platform. This subset is further partitioned into training (%) and validation (%) data-sets, which are then used to train and evaluate the various machine learning models, respectively. Without loss of generality, in this work, we evaluate the applicability of the most-commonly used light-weight S/ML models (see Table I) to reduce the time required for exploring the library of ACs. We iteratively evaluate the accuracy of the models and modify their parameters based on the correlation obtained on the validation data-set to further improve the model’s accuracy. Instead of synthesizing and implementing each circuit in the library, which might take weeks to months, we can roughly estimate the FPGA parameters of all circuits using these models in the order of seconds. To estimate the accuracy of these ML models, we propose the fidelity metric, which evaluates the relationship between the measured (mes) and estimated (est) FPGA parameters for any two ACs in the library. We compute the fidelity () of a set of ACs, , as:
where denotes the correctness of the relationship between the estimated and measured FPGA parameters:
where denotes one of the following relations between the FPGA parameters of the ACs. Due to their availability
|ML1||Regression w.r.t ASIC-AC Power|
|ML2||Regression w.r.t ASIC-AC Latency|
|ML3||Regression w.r.t ASIC-AC Area|
|ML7||Adaptive Boosting (AdaBoost)|
|ML12||Coordinate Descent (Lasso)|
|ML13||Least Angle Regression|
|ML15||Stochastic Gradient Descent|
Multi-Layer Perceptron (MLP)
Pareto Construction: Based on the outcome of our experiments (see Section IV), we select the best S/ML models to estimate the FPGA parameters of all ACs in the library. Based on these parameter-estimates, we can determine the pareto-optimal FPGA-ACs. However, we have observed that these models have limited fidelity, because of which the real pareto-optimal ACs can be dominated by the ACs where the estimation was incorrect. Therefore we propose to construct multiple pseudo-pareto-fronts from the input set (library) of ACs . We determine the first set of pseudo-pareto-optimal ACs () from the initial set of all ACs . Next, we eliminate all these pseudo-pareto-points from the input set to construct the second pseudo-pareto-front, i.e., using as the input, we determine . Similarly, we construct the third pseudo-pareto-front , using the input , and so on. By constructing multiple pseudo-pareto-fronts, we mitigate the inaccuracies associated with our S/ML models. The ACs lying on these pseudo-pareto-fronts can be subsequently synthesized again using our work-flow to determine the accurate FPGA parameters and the resources required. Hence, we have to synthesize an additional number of ACs when we are constructing multiple pseudo-pareto-fronts.
Based on the real FPGA parameter measurements obtained from the synthesis and implementation reports of Vivado, we construct an open-source library of pareto-optimal FPGA-ACs that offers a trade-off between the output quality and the resources consumed. This library can be subsequently utilized by application and system developers, to further maximize performance or power and energy savings obtained while satisfying the quality constraints of the application. The RTL and behavioral models of the FPGA-ACs are open-source and are available online at https://github.com/ehw-fit/approx-fpgas.
AutoAx-FPGA: To incorporate the set of pareto-optimal FPGA-ACs in different error-tolerant applications, we modify the state-of-the-art AutoAx [mrazek2019autoax] framework to include the functionality of designing ACs for a given application that can be deployed in FPGA-based systems. The traditional AutoAx framework searches the design-space of approximate components to select and combine approximation components, in order to generate an approximate hardware accelerator that maximizes the energy savings. Initially, a set of random approximation assignments are evaluated for the target accelerator circuit, to get the quality of results (QoR) and hardware (HW) cost of the accelerator. Based on these values, QoR and HW cost estimators are constructed, which can be used to explore the complete design-space of approximate components for the given accelerator and to determine the set of pareto-optimal circuits for the given application. To generate approximate accelerators for a given application, which can be used in low-power and/or high-performance FPGA-based systems, we propose to include the following functionality in AutoAx:
we replace the library of pareto-optimal ASIC-ACs with the set of pareto-optimal FPGA-ACs obtained from the proposed ApproxFPGAs methodology,
we modify the estimators used in AutoAx to estimate the FPGA parameters of the approximate accelerator instead of their ASIC-based HW costs.
Iii Experimental Setup
The RTL (in Verilog) and behavioral models (in C) of the evolutionary approximate arithmetic circuits are open-source and readily accessible111https://github.com/ehw-fit/evoapproxlib. These designs are synthesized and implemented (i.e., place & route) using the Vivado Design Suite for the target FPGA xc7vx485tffg1157-1, to extract their area, power, and timing reports. We restrict the placement and routing algorithms of the Xilinx Vivado by disabling the use of the FPGA’s DSP logic blocks. We do this to ensure that the designs are mapped on to the configurable logic. These reports are used to extract the FPGA parameters, which are subsequently used for training and evaluating the S/ML models. The S/ML models are implemented, trained, and tested inside the Python environment with the help of the scikit-learn library. The RTL designs were synthesized on an Intel Core i CPU with GB of internal memory and a GB Solid-State Drive (SSD). The S/ML models were trained and evaluated on an Intel Xeon CPU E with GB of internal memory. An overview of our work-flow is presented in Fig. 4.
Iv Results & Discussion
Fidelity: First, we illustrate the accuracy of the S/ML models that we have evaluated inside our ApproxFPGAs framework. We do this by studying the fidelity of these models with respect to the three important FPGA parameters, namely, latency (), power (), and area (#). The fidelity of these models is evaluated on the validation data-set. The results of these experiments are presented in Fig. 5. From these results, we make the following key observations:
Tree-based methods, like Decision Trees and Random Forrest, achieve above-average accuracy in estimating the FPGA parameters and retaining their relationship to the other ACs.
Based on further analysis, we also observed that generalization of models across all bit-widths is not very effective, i.e., estimating FPGA parameters of higher bit-width (-/-bit) designs (adder or multiplier) using a model learned from a lower bit-width (-bit) designs is not very effective. On average, we observed that the fidelity of the higher bit-width designs decreased from % to % when using models trained with lower bit-width designs as opposed to designs of the same bit-width.
Ridge models such as Kernel Ridge and Bayesian Ridge, typically, illustrate the best fidelity.
We also summarize the top- S/ML models for each FPGA parameter, along with the fidelity achieved for each case, in Table II
. Likewise, we identify the models that achieve maximum fidelity when obtained using regression analysis on their corresponding ASIC parameters.
|FPGA Latency||FPGA Power||FPGA Area|
Correlation of ML Models: Next, we illustrate the correlation between the estimated FPGA parameters and their measured values when the top- S/ML models are used on the library of approximate multipliers. These results are illustrated in Fig. 6. From these results, we make the following key observations:
The Bayesian Ridge and PLS regression techniques can be used as standalone techniques to estimate all three FPGA parameters, as they are one of the top- models for all three parameters.
Statistical regression with respect to the corresponding ASIC parameters is equally useful in estimating the FPGA parameters of the given circuit.
Due to the ~% bias illustrated by the model, latency is not estimated accurately, especially using regression with ASIC parameters and Kernel Ridge. This leads to a scenario where the circuit latency is under-estimated by the model, including certain pareto-optimal designs.
Construction of the Pareto-fronts: As discussed earlier, we construct multiple pareto-fronts to ensure that non-pareto-optimal designs are not missed by our methodology. Towards this, we illustrate the benefits of constructing multiple pareto-fronts sequentially for estimating the FPGA latency using the top- S/ML models and Regression with respect to ASIC latency. Fig. 7 illustrates the results of constructing , , and pareto-fronts using the technique discussed in Section II. From these results, we make the following key observations:
Using ML-based techniques for estimating the FPGA parameters reduces the total number of synthesized circuits by a factor of ~ to , including the training and validation data-set and synthesis of pseudo-pareto-optimal points, instead of synthesizing the complete library of approximate multipliers.
The ML models are highly effective in selecting the pseudo-pareto-optimal designs that have to be re-synthesized, as compared to the regression analysis w.r.t ASIC latency, which increases the number of circuits to be explored from in Bayesian Ridge to , effectively doubling the number of new circuits to be synthesized.
The best results are obtained when we effectively combine the pseudo-pareto-optimal points obtained from multiple ML models. Therefore, we need to consider a union of all the pareto-fronts to determine the final set of pareto-optimal FPGA-ACs.
Pareto-Optimal FPGA-ACs: Fig. 8 illustrates the set of FPGA-ACs synthesized to obtain the subset of pareto-optimal FPGA-ACs using our proposed ApproxFPGAs methodology on the library of -, -bit adders and , multipliers. Although we have not exclusively determined and synthesized all the pareto-optimal designs, we have reduced the exploration time a factor of ~ to obtain, on average, % percentage coverage of the pareto-optimal designs present in the library of approximate circuits. This is quite explicitly illustrated with the help of the pareto-front in the designs with a higher number of ACs present in the library, such as the approximate multipliers, and a little less explicit for libraries with a lower number of circuits, like the approximate adders. Similarly, we have generated the pareto-optimal ACs for the -bit approximate adder and approximate multiplier.
AutoAx-FPGA: Finally, we present the results of modifying the AutoAx framework to include the functionality of generating pareto-optimal accelerators for FPGA-based systems. We evaluated the modified AutoAx-FPGA methodology using a Gaussian Filter as a case-study and the input of pareto-optimal approximate multipliers and -bit approximate adders. The QoR of the Gaussian filter’s output is estimated using the structural similarity index (SSIM), for which we build an estimator. First, we generate a training and validation data-set of random approximate circuits for the given Gaussian filter, which was synthesized and implemented using the Vivado work-flow to measure their FPGA parameters such as area, latency, and power consumption. Similar to the AutoAx methodology, we constructed estimators that can determine the FPGA parameters for the other circuits in the library, and construct different pareto-fronts using the hill-climber algorithm. We thereby reduce the number of accelerator circuits to be explored from to , , and possible designs for each of the FPGA parameter-QoR scenarios, namely, latency-SSIM, power-SSIM, and area-SSIM, respectively. Each of these designs is synthesized in the Vivado work-flow and their behavioral models are deployed in the image processing environment to measure their FPGA parameters and determine their SSIM. These results are illustrated in Fig. 9 We can observe that AutoAx-FPGA achieves better results when compared to a simple random search. Furthermore, we can also observe that the optimization for area and power improve the savings obtained in other FPGA parameters as well, which is not the case when we optimize for latency. For example, in the case where we optimize for latency in Fig. 9, we would expect the SSIM-Latency pareto-front to encompass the best ACs in terms of latency, but this is not the case as the latency-estimator is not very effective. However, since the other two pareto-fronts improve the savings for other FPGA parameters as well, they outperform the SSIM-Latency pareto-front ACs, even in terms of latency.
We presented the ApproxFPGAs methodology, for embracing the use of current state-of-the-art ASIC-based approximate circuits for FPGA-based systems. We synthesize a partial subset of the library of arithmetic circuits to establish the training and validation data-set, which can be used to teach and evaluate the models’ applicability. Based on the outcome, we chose the top- models that achieve the best fidelity to estimate the FPGA parameters for all the circuits in the data-set, which is used to subsequently construct multiple pseudo-pareto-fronts. The circuits on these pareto-fronts are re-synthesized to measure the correct FPGA parameters and determine the final set of pareto-optimal FPGA-ACs, which can be used by system developers and application-designers to develop low-power or high-performance FPGA-based accelerators. This set of pareto-optimal arithmetic FPGA-ACs is open-source and available online at https://github.com/ehw-fit/approx-fpgas. Finally, we evaluate the applicability of these pareto-optimal ACs by using a modified version of the state-of-the-art AutoAx framework to illustrate the benefits obtained.
This work was partially supported by Doctoral College Resilient Embedded Systems which is run jointly by TU Wien’s Faculty of Informatics and FH-Technikum Wien, and partially by Czech Science Foundation project 19-10137S.