Multiplierless MP-Kernel Machine For Energy-efficient Edge Devices

06/03/2021 ∙ by Abhishek Ramdas Nair, et al. ∙ indian institute of science Washington University in St Louis 0

We present a novel framework for designing multiplierless kernel machines that can be used on resource-constrained platforms like intelligent edge devices. The framework uses a piecewise linear (PWL) approximation based on a margin propagation (MP) technique and uses only addition/subtraction, shift, comparison, and register underflow/overflow operations. We propose a hardware-friendly MP-based inference and online training algorithm that has been optimized for a Field Programmable Gate Array (FPGA) platform. Our FPGA implementation eliminates the need for DSP units and reduces the number of LUTs. By reusing the same hardware for inference and training, we show that the platform can overcome classification errors and local minima artifacts that result from the MP approximation. Using the FPGA platform, we also show that the proposed multiplierless MP-kernel machine demonstrates superior performance in terms of power, performance, and area compared to other comparable implementations.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Edge computing is transforming the way data is being handled, processed, and delivered in various applications[satyanarayanan2017emergence][shi2016edge]

. At the core of edge computing platforms are embedded machine learning (ML) algorithms that make decisions in real-time, which endows these platforms with greater autonomy 

[wang2018edge][zhu2020toward]

. Common edge-ML architectures reported in the literature are based on a deep neural network 

[li2018learning]

or support vector machines 

[flouri2006training], and one of the active areas of research is to be able to improve the energy efficiency of these ML architectures, both for inference and learning [cui2019stochastic]. To achieve this, the hardware models are first trained offline, and the trained models are then optimized for energy efficiency using pruning or sparsification techniques [han2015deep] before being deployed on the edge platform. An example of such design-flows is the binary or the quaternary neural networks, which are compressed and energy-efficient variants of deep-neural network inference engines[zhao2020review]

. However, robust training of quantized deep learning architectures still requires full precision training, obviating their use on edge devices with low computational resources. Also, deep neural networks require a large amount of training data 

[guo2018survey] which is generally unavailable for many applications. On the other hand, Support Vector Machines (SVMs) can provide good classification results with significantly less training data. The convex nature of SVM optimization makes its training more robust and more interpretable [cortes1995support]. SVMs have also been shown to outperform deep learning systems for detecting rare events[tang2008svms][palaniappan2012abnormal], which is generally the case for many IoT applications. One such IoT-based edge device architecture is depicted in Fig. 1. Data from video surveillance, auditory event, or motion sensor can be analyzed, and the system can be trained on the device to produce robust classification models.

Fig. 1: Edge Device with Online Learning Capability

At a fundamental level, SVMs and other ML architectures extensively use Matrix-Vector-Multiplication (MVM) operations. One way to improve the overall system energy efficiency is to reduce the complexity or minimize MVM operations. In literature, many approximations and reduced precision MVM techniques have been proposed [choi2019approximate] and have produced improvements in energy efficiency without significantly sacrificing classification accuracy. Kernel machines have similar inference engine as SVMs, sharing the execution characteristics with SVM [doi:10.1198/jasa.2003.s270]. This paper proposes a kernel machine architecture that eliminates the need for multipliers. Instead, it uses more fundamental but energy efficient and optimal computational primitives like addition/subtraction, shift, and overflow/underflow operations. For example, in a 45 nm CMOS technology, it has been shown that an 8-bit multiplication operation consumes 0.2 pJ of energy, whereas an 8-bit addition/subtraction operation consumes only 0.03 pJ of energy[horowitz20141]. Shift and comparison operations consume even less energy than additions and subtractions, and underflow/overflow operations do not consume any additional energy at all. To achieve this multiplierless mapping, the proposed architecture uses a margin propagation (MP) approximation technique originally proposed for analog computing [chakrabartty2004margin]. MP approximation has been optimized for a digital edge computing platform like field-programmable gate arrays (FPGA) in this work. We show that for inference, the MP approximation is computed as a part of learning, and all the computational steps can be pipelined and parallelized for other MP approximation operations. In addition to reporting an MP approximation-based inference engine, we also report an online training algorithm that uses a gradient-descent technique in conjunction with hyper-parameter annealing. We show that the same hardware can be reused for both training and inference for the proposed MP kernel machine. As a result, the MP approximation errors can be effectively reduced. Since kernel machine inference and SVM inference have similar equations, we compare our system with traditional SVMs. We show that MP-based kernel machines can achieve similar classification accuracy as floating-point SVMs without using multipliers or equivalently any MVMs.

The main contributions of this paper are as follows:

  • Design and optimization of MP-approximation using iterative binary addition, shift, comparison, and underflow/overflow operations.

  • Implementation of energy-efficient MP-based inference on an FPGA-based edge platform with multiplierless architecture that eliminates the need for DSP units.

  • Online training of MP-based kernel machine that reuses the inference hardware.

The rest of this paper is organized as follows. Section II briefly discusses related work, followed by section III, where we explain the concept of multiplierless inner product computation and the related MP-based approximation. Section IV presents the kernel machine formulation based on MP theory. Section V details the online training of the system. Section VI provides the FPGA implementation details and contrasts with other hardware implementations of MP-based kernel machine algorithm. Section VII discusses results obtained with few classification datasets and compares our multiplier-less system with other SVM implementations, and Section VIII concludes this paper and discusses potential use cases.

Ii Related Work

Energy-efficient SVM implementations have been reported for both digital [genov2003kerneltron] and analog hardware [chakrabartty2003forward], which also exploit the inherent parallelism and regularity in MVM computation. In [kyrkou2011parallel], an SVM architecture has been reported using an array of processing elements in a systolic chain configuration implemented on an FPGA. By exploiting resource sharing capability among the processing units and efficient memory management, such an architecture permitted a higher throughput and a more scalable design.

Digital and optimized FPGA implementation of SVM has been reported in [kyrkou2013embedded] using a cascaded architecture. A hardware reduction method is used to reduce the overheads due to the implementation of additional stages in the cascade, leading to significant resource and power savings for embedded applications. The use of cascaded SVMs increase the classification speeds, but the improved performance comes at the cost of additional hardware resources.

Ideally, a single-layered SVM should be enough for classification at the edge. In [jiang2017fpga]

, SVM prediction on an FPGA platform has been explored for ultrasonic flaw detection. Since the SVM training phase needs a large amount of computation power and memory space, and often retraining is not required, the SVM training was realized offline. Therefore, the authors chose to accelerate only the classifier’s decision function.

Often, training offline is not ideal for edge devices, primarily when the device is operating in dynamic environments, i.e., ever-changing parameters. In such cases, retraining of the ML architecture becomes important. One such system was reported in [dey2018highly] and used an FPGA implementation of a sparse neural network capable of online training and inference. The platform could be reconfigured to trade-off resource utilization with training time while keeping the network architecture the same. Even though the same platform can be used for training and inference, memory management and varying resource utilization based on training time make it less conducive to deploy on the edge device. Reconfiguration of the device would require additional usage of a microcontroller, increasing the system’s overall power consumption.

As hardware complexity and power reduction are a major concern in these designs, the authors in [mandal2014implementation] implement a multiplierless SVM kernel for classification in hardware instead of using a conventional vector product kernel. The data-flow amongst processing elements is handled using a parallel pipelined systolic array system. In the multiplierless block the authors use Canonic Signed Digit (CSD) to reduce the maximum number of adders. CSD is a number system by which a floating-point number can be represented in two’s complement form. The representation uses only -1, 0, +1 (or -, 0, +) symbols with each position denoting the addition and subtraction of power of 2. Despite being multiplierless, the system consumes many Digital Signal Processors (DSPs) due to the usage of polynomial kernel having exponent operation. The usage of DSPs increases the power consumption of the design.

Another system that uses multiplierless online training was reported in [xue2020real]

. The authors use a logarithm function based on a look-up table, and a float-to-fixed point transform to simplify the calculations in a Naive Bayes classification algorithm. A novel format of a logarithm look-up table and shift operations are used to replace multiplication and division operations. Such systems incur an overhead of generating the logarithmic look-up table for most operations. The authors chose to calculate logarithmic values offline and store them in memory, contributing to increased memory access for every operation.

l.1int.1in(a)

(a)

l.1int.1in(b)

(b)

l.05int.1in(c)

(c)
Fig. 5: (a) Scalar Inner Product expressed in quadratic equation for 3 different , i.e.,(). Here, and . (b) Scalar Inner Product approximation expressed using log-sum-exponential terms for 3 different , i.e.,(). Here, and . (c) Inner Product and Log-Sum-Exponential Approximation scatter plot of 64 dimensional vectors with each input randomly varied between -1 and +1

There have been various neural network implementations that eliminate the usage of multipliers in their algorithms. In [chen2020addernet]

, Convolutional Neural Networks (CNNs) are optimized using a new similarity measure by taking

norm distance between the filters, and the input features as the output response. Even though it eliminated the multipliers, this implementation requires batch normalization after each convolution operation, resulting in usage of multipliers for this operation. It also requires a full precision gradient-descent training scheme to achieve reasonable accuracy.

Similarly, in [elhoushi2019deepshift], convolutional shifts and fully connected shifts are introduced to replace multiplications with bitwise shifts and sign flipping. The authors use powers of 2 to represent weights, bias, and activations and use compression logic for storing these values. There is a significant overhead for compression and decompression logic for the implementation of an online system. In [you2020shiftaddnet]

, the authors leverage the idea of using additions and logical bit-shifts instead of multiplications to explicitly parameterize deep networks that involve only bit-shift and additive weight layers. This network has limitations for implementing activation functions in the shift-add technique. In all these neural network systems, the training algorithm and activation implementations involve multipliers. Hence these systems cannot be termed as a complete multiplierless system.

l.1int.1in(a)

(a)

l.05int.1in(b)

(b)

l.1int.1in(c)

(c)
Fig. 9: (a) Scalar Inner Product approximation expressed in MP domain for 3 different , i.e.,(). Here, and

. (b) Inner Product and MP Approximation scatter plot of 64 dimensional vectors with each input randomly varied between -1 and +1 (c) Variance in error of

value due to shift approximation of becomes zero after 10 iterations

In this work, we propose to use an MP approximation to implement multiplierless kernel machine. MP-based approximation was first reported in [chakrabartty2004margin], and in [kucher2007energy], an MP-based SVM was reported for analog hardware. The main objective of this work is to build scalable digital hardware using an optimized MP approximation. Also, the previous work in MP-based SVM has used offline training. Our system is a one-of-a-kind digital hardware system using a multiplierless approximation technique in conjunction with online training and inference on the same platform.

Iii Multiplierless Inner-product Computation

In this section, we first describe a framework to approximate inner-products using a quadratic function, which is then generalized to the proposed MP-based multiplierless architecture. Consider the following mathematical expression

(1)

where is a Lipschitz continuous function, is a scalar variable, and are dimensional real vectors.

Corollary 1: If we choose to be a quadratic equation as , where is a constant, we get,

(2)

Eq.(2) is an exact inner-product between the vectors and . Fig.(a)a illustrates this process using two scalars and using a one dimensional quadratic function

(3)

However, implementing multiplication and inner-products using this approach on digital hardware would require computing a quadratic function, which would require using a look-up table or other numerical techniques [sadeghian2016optimized]. Also, this approach does not consider the finite dynamic range if the operands are represented using fixed precision. While the effect of finite precision might not be evident for 16-bit operands, when the precision is reduced down to 8-bits or lower, will saturate due to overflow or underflow. Next, we consider a form of that captures the effect of saturation.

Corollary 2: Let be a log-sum-exponential (LSE) function defined over the elements of as , then according to eq. (1) we get,

(4)

The effect of eq. (4) can be visualized in Fig.(b)b for scalars and and for a one-dimensional LSE function

(5)

From Fig.(b)b, we see that for smaller values of the operands, the approximates the multiplication operation, whereas for larger values saturates. This effect also applies to general inner-products using multi-dimensional vectors, as described by eq.(4). Fig.(c)c shows the scatter plot that compares the values of computed using eq.(2) and its log-sum-exponential approximation given by eq.(4), for randomly generated dimensional vectors and . The plot clearly shows that the two functions approximate each other, particularly for a smaller magnitude of . Like the quadratic function, implementing the LSE function on digital hardware would also require look-up tables. Note that other choices of could also lead to similar multiplier-less approximations of the inner-products. However, we are interested in finding the that can be easily implemented using simple digital hardware primitives.

l.03int.08in(a)

(a)

l.03int.08in(b)

(b)

l.03int.08in(c)

(c)

l.03int.08in(d)

(d)
Fig. 14: Impact of quantization and approximation on inner product normalized between 0 and 1 for 1000 pairs of 64 dimensional vector randomly sampled between -1 and 1.

Iii-a Margin Propagation based Multiplierless Approximation

Margin Propagation is an approximating technique which can be used to generate different forms of linear and nonlinear functions using thresholding operations [gu2012theory]. Here, we explain the steps to derive this technique. We first express the equation,

(6)

as a constraint

(7)

In the first-order MP-approximation, the is approximated using a simple piecewise linear model as

(8)

where is a rectifying linear operation, is the approximation of . Thus, the constraint in eq.(7) can be written as

(9)

Here, is computed as the solution of the equation and forms the MP function. In [gu2012theory], the eq.(9) was generalized further to include a hyper-parameter as

(10)

Thus, can be viewed as a function of and , which is the MP function . Note that all operations in eq.(10) requires only unsigned arithmetic, and operation is naturally implemented by register underflow. So MP-function could be readily and efficiently mapped onto low-precision digital hardware for computing inner-products and multiplications. Based on eq.(1) and (5), the approximation of a multiplication operation can be visualized for scalars and , for a one-dimensional MP-function as

(11)

Fig.(a)a shows that the MP-function computes a piecewise linear approximation of the LSE function and exhibits the saturation due to register overflow/underflow.

When the function in eq.(1) is replaced by , an approximation to the inner-product can be obtained using only piecewise linear functions. Fig.(b)b shows the scatter plot that compares the true inner-product with the MP-based inner-product, showing that the approximation error is similar to that of the log-sum-exp approximation in Fig.(c)c. Thus, MP-function serves as a good approximation to the inner-product computation.

Iii-B Implementation of MP-function on Digital Hardware

Given an input vector and hyper-parameter , in [gu2012theory] an algorithm was presented to compute the function . The approach was based on iterative identification of the set using the reverse water filling approach. From eq.(10), we get the expression of as

(12)

where is the size of the set . Since is a function of , the expression in eq.(12) requires dividers.

We now report an implementation of MP-function that uses only basic digital hardware primitives like addition, subtraction, shift, and comparison operations. The implementation poses the constraint in eq.(10) as a root-finding problem for the variable . Then, applying Newton-Raphson recursion, the MP function can be iteratively computed as

(13)

Here, and . The Newton-Raphson step can be made hardware-friendly by quantizing to the nearest upper bound of power of two as or by choosing a fixed power of 2 in terms of . Here, as shown in Algorithm 1. This is implemented in hardware using a priority encoder that checks the highest bit location whose bit value is for the variable . Then, the division operation in eq.(13) can be replaced by a right-shift operation. This approximation error can be readily tolerated within each Newton-Raphson step since each step will correct any approximation error in the previous step. Note that the Newton-Raphson iteration converges quickly to the desired solution, and hence only a limited number of iterations are required in practice. Fig.(c)c shows several examples of the MP-function converging to the final value within 10 Newton-Raphson iterations for a 100-dimensional input . Thus, in our proposed algorithm in Algorithm 1 used for computing the MP-function we limit the number of Newton-Raphson iterations to 10.

Inner product bit width
(64-dimensions)
Input bit width
Quantized MP-Quantized
16-bit 5-bit 9-bit
14-bit 4-bit 7-bit
12-bit 3-bit 5-bit
10-bit 2-bit 3-bit
TABLE I: Bit width comparison for both types of quantization used in Fig. 14. We use higher precision MP inputs for the same bit width inner product output, resulting in better outputs.

In order to make any algorithm hardware-friendly, the classic approach is to express it in minimum possible bit width precision with minimal loss in functionality. This would help reduce area and power when implementing digital hardware. In Fig. 14, we see the impact of reducing the bit precision on the approximation of inner product for a 64-dimensional input vector sampled between -1 and 1. We see the variance in inner product for MP approximation exists, which can be termed as the MP approximation error, compared to the standard quantization of the inner product even for higher bit precision (Fig. 14(a) and  14(b)).We use the online learning approach, detailed in Section V, for MP approximation in our system with minimal hardware increase to mitigate this approximation error. However, as we reduce the bit precision further, we see the quantization error increases and overlaps the MP approximation error (Fig. 14(c) and  14(d)). We see in Table I, for an -bit inner product output, the standard quantized version requires  - bit input vector, whereas quantized MP inner product requires  - bit input vector due to addition operation instead of multiplication. Here, we use 6-bits for the 64-dimensional input vector. This shows that MP has better input precision for the same output bit width, leading to better output approximation. This MP approximation error too can be mitigated using the same online learning. Thus, the MP approximation, despite quantization, performs better or is equivalent to the standard quantized approach as we reduce the bit precision of the output. At the same time, MP uses lesser hardware for the output with the same bit width inputs as it uses simple hardware elements like adders in comparison to multipliers for the standard approach. Hence, MP proves to be a hardware-friendly approximation even when the algorithm is quantized.

Input: ,
Output: ,,
Intialize ;
  // Initalize to zero
unsigned int ;
for  to  do
       and ;
       for  to  do
             ;
             if  then
                   ;
                   ;
                  
             end if
            
       end for
       ;
       ;
       ) ;
      
end for
Algorithm 1 Newton Raphson method based MP formulation. is the approximated value of variable count and represented as the nearest upper bound power of 2.

Fig. 15: MP kernel machine architecture. Kernel MP is based on eq. (25) and the other MP functions are described from eq. (20) to (22).

Iii-C Energy-cost of MP-based Computing

Let , and denote the total energy-cost corresponding to an MVM operation, a single multiplication operation, and a single addition operation, respectively. Then, MVM of a vector and a matrix would incur an energy-cost of

(14)

For an MP-based approximation of the MVM, the energy cost incurred is

(15)

Here, is the sparsity factor determined by , typically having a value less than 1, and is the energy-cost for a comparison operation having complexity. Also, note that is the number of Newton-Raphson iterations, which is in our case. Thus, as inputs increase beyond , MP approximation complexity reduces further compared to that of MVM. Multiplication requires full adders for -bit system, i.e., [vittoz1990future]. has linear complexity, i.e., . Hence, the MP approximation technique becomes ideal for digital systems as the implementation of adder logic is less complex and low on resource usage than the equivalent multiplier.

Iv MP Kernel Machine Inference

We now use the MP-based inner product approximation to design an MP kernel machine. Consider a vector , the decision function for kernel machines [cristianini2000introduction] is given as,

(16)

where , is the kernel which is a function of , and , is the corresponding trained weight vector and bias, respectively. As, MP approximation, as shown in eq. (11), is in differential format, we express the variables as , and .

(17)

Using eq.(2), and applying MP approximation based on eq.(11), we can express eq.(17) as,

(18)

Fig.15 describes the kernel machine architecture using MP approximation. The input is provided to a difference unit to generate and vectors. The kernel MP generates the kernel output with inputs as a combination of based on the kernel used. In our case, we use the kernel mentioned in next section IV-A. This kernel is used to produce and with the help of the difference unit. The weights and bias generated, described in section V, are used with the kernel combination to generate MP approximation output as below. We can express eq.(18) as,

(19)

where,

(20)
(21)

is a hyper-parameter that is learned using gamma annealing described in Algorithm 2. We normalize the values for and for better stability of the system using MP,

(22)

Here, is the hyper-parameter used for normalization. In this case, . The output of the system can be expressed in differential form,

(23)

Here, , and . As is the normalizing factor for and

, we can estimate the output using reverse water filling algorithm

[gu2012theory], which is generated by the MP function in eq.(22) for each class,

(24)

Iv-a MP Kernel Function

We use a similarity measure function, which has similar property as a Gaussian function, between the vectors and for defining Kernel MP approximation as,

(25)

is the MP hyper-parameter for kernel. We define = . The kernel function is derived in detail in Appendix A.

Complexity of Kernel Machine, based on eq.(14) and (16), can be expressed as,

(26)

and similarly, complexity of kernel machine in MP domain based on eq. (15) can be expressed as,

(27)

The complexity equations show that the MP kernel machine complexity is a fraction of that of traditional SVM. This can be leveraged to reduce the power and hardware resources and increase the speed of operation of the MP kernel machine system over traditional SVM.

V Online training of MP Kernel Machine

The training of our system requires cost calculation and parameter updates to be done over multiple iterations () using the gradient descent approach, which is described below.

Consider a two-class problem, and the cost function can be written as,

(28)

where is the number of input samples, and are the class labels where and are the respective predicted values. . We have selected the absolute cost function as it is easier to implement on hardware, as we require fewer bits to represent this cost function than the squared error function.

Fig. 16: High-level block diagram of MP kernel machine. The proposed design shares hardware resources during both the training and inference phase. The red blocks are used only during training.

The weights and biases are updated during each iteration using gradient descent optimization,

(29)
(30)
(31)
(32)

where is the learning rate, = , = , = , = , =, ==, =, = and = . is the indicator function. An indicator function is defined on a set indicating membership of an element in a subset of , having the value for all elements of in and value for all elements of not in . The gradient descent steps have been derived in detail in Appendix B.

Input: ,,,
Output:
Intialize ;
  // Initalize to a value based on input for MP function
for  to  do
       if ( then
             ;
            
       end if
      
end for
Algorithm 2 Gamma Annealing. is the value of cost function (28) estimated at iteration . Here, and are empirical values based on the input dataset. is the number of iteration for the MP gradient update.

The parameter in eq.(12) impacts the output of MP function and hence can be used as a hyper-parameter in the learning algorithm for the 2nd layer as per Fig.15, i.e., . This value is adjusted each iteration based on the change in the cost between two consecutive iterations as described in Algorithm 2.

Since the gradient is obtained using the MP approximation technique, the training of the system is more tuned to minimizing the error rather than mitigating the approximation itself and hence improves the overall accuracy of the system.

As we see in all these equations, we require only basic primitives such as comparator, shift operators, counters, and adders to implement this training algorithm, making it very hardware-friendly in terms of implementation.

Vi FPGA Implementation of MP Kernel Machine

The FPGA implementation of the proposed multiplier-less kernel machine is described in this section. It is a straight forward hardware implementation of the MP-based mathematical formulation proposed in section IV. The salient features of the proposed design are as follows:

  • The proposed architecture is designed to support 32 input features and a total of 256 kernels. It can easily be modified to support any number of features and kernel sizes by increasing or decreasing the MP processing blocks.

  • The design parameters are set to support 32 input features (in differential mode and ) and 256 stored vectors (in differential form and ). Here, and .

  • In this design, the width of the data path is set to 12-bits.

  • The proposed design includes inference and training and does not consume any complex arithmetic modules like multiplier or divider, only uses an adder, subtractor, and shifter modules.

  • The resource sharing between training and inference modules saves a significant amount of hardware resources. The weight vectors (also in differential form and ) calculation, as shown in red in Fig.16, are the additional blocks required for training.

Fig. 17: Architecture of an MP processing block.

Fig. 18: Architecture for cost function calculation and gamma annealing.

Fig. 19: Architecture for weights and bias update module. The final weights (, ) and bias values (, ) are stored in the corresponding memory blocks (MEM0 and MEM1) and registers, respectively. The parameters and (shown in red) are calculated using the adder and combinational shifter modules. The (shown in blue) values are provided by the MP processing block Fig.17

Vi-a Description of the proposed design:

The high-level block diagram of the proposed MP-kernel machine, which can support both online training and inference, is shown in Fig.16. The architecture has four stages. Stage 0 shows a memory management unit for storing input samples from the external world to input memory blocks. The MP-based kernel function computation is described in Stage 1. Stage 2 provides a forward pass to generate necessary outputs for inference and training based on the kernel function output obtained from Stage 1. Stage 3 is used only for online training and calculates weight vectors (,) and bias values (, ). Here all the stages execute in a parallel manner after receiving the desired data and control signals.

Stage 0 (Accessing inputs from the external world): The input features (, ) of a sample are stored in Block RAM (BRAM), either IPMEM0 or IPMEM1. These BRAMs act as a ping-pong buffer. When an incoming sample is being written into IPMEM0, the kernel computation is carried out on the previously acquired sample in IPMEM1 and vice-versa. The dimension of each input memory block is -bit.

Stage 1 (Kernel function computation): The architectural design of an MP processing block is shown in Fig.17, the straight forward implementation of Algorithm 1. The inputs appear in the MP unit serially i.e. one input in each clock cycle, and calculates . The having positive value is getting accumulated in register, where as number of positive terms is counted on register. The of is used to detect positive terms. The is passed through the OR gate to perform bit-wise OR and the output is used to discard with value zero. After accessing all the inputs, the value is approximated to nearest upper bound of power of two, i.e. and . This is implemented using a simple priority encoder followed by an incrementer. The priority encoder checks the highest bit location whose bit value is for the variable . The combinational shifter right shifts the value by number of bits. After that, is updated with the summation of the previous value of and shifter output. This process iterates 10 times to get the final value of . The high-level architecture for computing MP kernel function () represented in eq.(25), is shown in Fig.16. Here, 64 MP processing blocks (KMP0-63) are executed parallely and reuses these blocks 4 times to generate a kernel vector () of dimension . Each MP processing block is associated with a dual-port BRAM named SVMEM for storing 4 support vectors. Here, SVMEM0 for KMP0 stores support vectors with indices 0, 64, 128, and 192. Similarly, SVMEM1 stores the vectors with indices 1, 65, 129, and 193. The rest of the support vectors are stored similarly in the respective SVMEM. An MP processing block receives the inputs serially as mentioned in eq.(25), i.e., , , , , , and generates output after 10 iterations. Here, all the 64 MP processing blocks receives inputs and generates outputs simultaneously. The generated kernel vector is stored in either KMEM0 or KMEM1 based on the select signal (working as a ping-pong buffer). Here, =.

[t] Hardware Comparison Kyrkou et.al [kyrkou2011parallel] Kyrkou et.al [kyrkou2013embedded] Jiang et.al [jiang2017fpga] Mandal et.al [mandal2014implementation] This Work FPGA Virtex 5-LX110T Virtex 5-LX110T Zynq 7000 Virtex 7vx485tffg1157 Artix-7 xc7a25tcpg238-3 Operating Frequency 100 MHz 84 MHz NA+ NA+ 62 MHz (Max.200 MHz)* DSPs 40 59 152 515 0 FFs 23220 13038 21305 19023 9735 (Max. 9788)* LUTs 57296 31854 14028 19023 9535 (Max. 9874)* BRAMs (36 Kb) 83 131 106 NA+ 35 Kernel Size 804 NA+ 10241024 NA+ 25632

  • These works did not report the corresponding values for their designs.

  • These values correspond to design changes for operating at a frequency of 200 MHz.

TABLE II: Comparison of Architecture and Resource Utilization of Related Work.

Stage 2 (Merging of Kernel function output): In this stage, three MP processing blocks (MMP0-2) are arranged in a layered manner. In the first layer, two MP processing blocks, i.e., MMP0 and MMP1, execute simultaneously, and in the second layer, MMP2 starts processing after receiving outputs from the first layer. The MMP0 implements the eq.(20) and generates output . The inputs arrive at the MP block serially in the following order + , + , . Similarly, MMP1 takes (+ ), (+ ), , and as inputs and produces as the output according to eq.(21). In this stage, is accessed from either KMEM0 or KMEM1 based on signal, and are accessed from MEM0 and MEM1, respectively.

The two outputs ( and ) generated from the first layer are provided as inputs to MMP2 along with . This module generates outputs and according to the eq.(24). The final prediction value is computed based on the eq.(23). In the Fig.17 = and = at round.

Stage 3 (Weights and bias update module): This stage executes only during the training cycle. The detailed hardware architecture for updating weights and bias is shown in Fig.19. The proposed design performs error gradients update (, , , ) followed by weights and bias update at the end of each iteration () according to eq.(29)-(32). The hardware resources of stages 0, 1 and 2 are shared by both the training and inference cycles. This stage updates weight vectors (, ) and bias values (, ) based on the parameters generated at Stage 2. At the training phase, outputs , and are generated from MMP0, MMP1, and MMP2, respectively, and stored in three different registers. After that, two parameters and are calculated using the adder and combinational shifter modules. The precomputed values and are used to update the weights and bias values according to eq.29-32. The sign bit, i.e., and , originated from MMP0 and MMP1 respectively at the round are passed through two separate 8-bit serial in parallel out (s2p) registers to generate 8-bit data which is stored in two separate BRAMs (MEM2 and MEM3, respectively). The memory contents are accessed through an 8-bit parallel in serial out (p2s) register during processing. Similarly, the sign bits and from MMP2 are also stored in two separate registers. The sign bits are used to implement indicator function () represented in eq.(29)-(32). An architecture that calculates both cost function () and gamma annealing is shown in Fig. 18. The architecture works in two phases, in the first phase (), the cost function is calculated according to eq.(28), and in the second phase (), gamma annealing is performed as per the Algorithm 2 mentioned in Section V. The signal initializes register and the register content getting updated in each iteration () according to the Algorithm 2. In the first phase, the module also calculates , and stores the outputs in two registers ( and ), respectively, which are used as inputs to the weights and bias update module.

The datapath for updating error gradients as well as weights and bias updates are shown in Fig. 19. The values of and are stored in MEM4 and MEM5 BRAMs (2569-bit each), respectively, and the values of and are stored in registers. At the beginning of each iteration (), these are initialized with zero. The error gradients are accumulated for each sample, and the respective storage is updated. The mathematical formulation for the error gradient update is explained in Appendix.

In Fig. 19, the MUXes are used to select appropriate inputs for the adder-subtractor modules. The signals and are used to select the appropriate operations. The signal is generated by delaying by one clock cycle. When the of is 0, it accesses and updates the MEM4 and MEM5 alternatively in each clock cycle. After that, the of becomes high and updates the registers. While processing the last sample of an iteration (), an active high signal along with starts updating the MEM0 and MEM1 for and , and the registers for and . Here, the learning rate () is expressed in powers of 2 and can be implemented using a combinational shifter module.

Vii Results and Discussion

The proposed design has been implemented on Artix-7 (xc7a25tcpg238-3), manufactured at 28 nm technology node, a low-powered, highly efficient FPGA. Artix-7 family devices have been extensively used in edge device applications like automotive sensor processing. This makes it ideal for us to showcase our design capability on this device. Our design is capable of running at a maximum frequency of 200 MHz. The design supports both inference and on-chip training. It consumes about 9874 LUTs, 9788 FFs, along with 35 BRAMs (36 Kb). The design does not utilize any complex arithmetic module such as multiplier. Table III summarizes implementation results. For processing a sample, the proposed MP-kernel machine consumes 8024 clock cycles for computing a () kernel vector (Stage 1). Stage 2 consumes 5256 and 5710 clock cycles during inference and learning, respectively. Stage 3, which is active during learning, consumes 524 clock cycles.

MP Kernel Machine Summary
(Training and Inference)
FFs 9788 (Total 29200)
LUTs 9874 (Total 14600)
BRAMs (36 Kb) 35 (Total 45)
Maximum Frequency 200 MHz
TABLE III: FPGA Implementation Summary for MP Kernel Machine Training and Inference.
Hardware
Resources
Traditional
SVM
(Inference)
MP
Kernel Machine
(Training and Inference)
FPGA
xc7z020-1
clg400c
xc7z020-1
clg400c
xc7a25
tcpg238-3
FFs 6148 9734 9735
LUTs 18141 9572 9535
BRAMs (36k) 46 35 35
DSPs 192 0 0
Dynamic Power 320 mW 107 mW 106 mW
TABLE IV: Comparison of Traditional SVM Inference and MP Kernel Machine (Training and Inference) Resource Utilization at operating frequency of 62 MHz.

We compare our system with similar SVM systems in the literature. From Table II, it is clear that our system requires fewer resources while consuming the least amount of power of 106 mW compared to similar SVM systems with online training capability. A traditional SVM inference algorithm is also implemented on the PYNQ board (xc7z020clg400-1), manufactured at 28 nm technology node, to compare resource utilization and power consumption with the proposed MP kernel machine algorithm. The hardware design for the traditional SVM algorithm consumed a higher number of resources, and due to this, we were unable to fit the design in the same FPGA part used for the kernel machine design. To obtain a fair comparison, we implemented our design on the same PYNQ board. The design parameters for both designs are the same to make a direct comparison and bring out the efficiency of our system. The maximum operating frequency of the traditional SVM design is limited to 62 MHz. The power and efficiency of our system, which includes training and inference, is compared to this traditional SVM implementation having only an inference engine in Table IV. We see that our MP kernel machine consumes a fraction of the power and about half of the LUTs (including the training engine) in comparison to the traditional SVM. Due to the multiplierless design, we see no DSPs consumed compared to 192 DSPs in traditional SVM design. For edge devices, the requirement of a small footprint and low power is fulfilled by our system.

Datasets Full Precision
Fixed Point
(12-bit)
Traditional
SVM
MP
SVM
MP
SVM
Train
Test
Train
Test
Train
Test
Occupancy
Detection
99.2 94.6 98 94 97.9 93.8
FSDD
Jackson
99 96.8 97.5 96 97 96
AReM
Bending
97 95.2 96 94.1 96 93.2
AReM
Lying
97.4 95.5 96 94.7 95.9 94
TABLE V: Accuracy in % for UCI Datasets in Percent.

Fig. 20: Bit precision vs. Accuracy for Datasets

We used datasets from different domains for classification to benchmark our system. We used the occupancy detection dataset from the UCI ML repository [Dua:2019] [candanedo2016accurate], which detects whether a particular room is occupied or not based on different sensors like light, temperature, humidity, and . We also verified our system on the Activity Recognition dataset (AReM) [palumbo2016human]. We used a one versus all approach on the AReM dataset for binary classification using this system. We chose two of the activities as positive cases, i.e., bending and lying activities, to verify the classification the capability of our system. Free Spoken Digits Dataset (FSDD) was used to showcase the capability of our system for speech applications [jackson2016free]. Here, we used this dataset for the speaker verification task. We identified a speaker named Jackson among 3 other speakers. Mel-Frequency Cepstral Coefficients (MFCC) is used as a feature extractor for this classification.

The results were simulated in the MATLAB tool. Since our hardware currently supports 256 input samples, we truncated the datasets to 256 inputs. We randomly chose 256 train and test samples from the original datasets and repeated the classification over multiple iterations to get the results.A MATLAB model of the proposed architecture is developed to determine the datapath precision. We can see from Fig.20, that the dataset’s accuracy remains more or less constant as we reduce the bit width precision up to 12-bit. Below 12-bit, accuracy starts degrading due to quantization error. Hence, we used 12-bit precision for implementing the datapath. The accuracy degradation between full precision MATLAB model and fixed point (12-bit) RTL versions were minimal, as shown in Table V. We also compared the results of traditional SVM and MP-based SVM systems. Despite being an approximation, we can get the accuracy results of our system comparable to the traditional SVM systems. From these results, our system demonstrates its capability in classification tasks, and also, with minor changes to the hardware, it can adapt to any application.

Viii Conclusion

In this paper, we show a novel algorithm used for classification in edge devices using MP function approximation. This algorithm proves to be hardware-friendly since the training and inference algorithm can be implemented using basic primitives like adders, comparators, counters, and threshold operations. The unique training formulation for kernel machines is lightweight and hence enables online training. The same hardware is used for training and inference, leveraging hardware reuse policy to make it a highly efficient system. Also, the system is highly scalable without requiring significant hardware changes. In comparison to traditional SVMs, we were able to achieve low power and computational resource usage, making it ideal for edge device deployment. This algorithm being multiplier-less improves the speed of operation when compared to traditional SVMs. Such edge devices can be deployed in remote locations for surveillance and medical applications, where human intervention may be minimal. We can fabricate this system into Application Specific Integrated Chip to make it more power and area efficient. Moreover, this algorithm can be extended to more complex ML systems like deep learning to leverage the low hardware footprint.

Appendix

A Kernel Derivation

The similarity measure function used as the kernel can be described as,

(33)

Here, and , . We use the difference vectors of and to derive for MP domain, i.e., using and , we can write,

(34)

Ensuring or such that and . Similarly, we get . Using these constraints, we can show,

(35)

Adding eq.(34) and (35), we get,

(36)

Thus, using eq.(33), we can write,

(37)

Using eq.(2), and applying MP approximation based on eq.(11), we can express eq.(37) as,

(38)

is the MP hyper-parameter for kernel.

B Partial Derivatives

Taking partial derivative of eq.(28) with respect to weights and and biases and , we get,

(39)
(40)
(41)
(42)

Taking partial derivative of eq.(12) with respect to ,

(43)

where