I Introduction
InternetofThings like unattended ground sensors[11] (UGS), intruder detection systems[16] [39], wildlife tracking [47] or, structural health monitoring systems [4] generally operate in remote locations. They have to be active at all times to ensure that it can detect events of interest. In most cases, the events of interest are infrequent or rare. As a result, most of these IoT systems use an embedded pattern classifier to relax data storage and wireless transmission requirements [5]. An example of such a system is illustrated in Fig. 1. Here the system wirelessly transmits alerts only when it detects signatures (acoustic or visual) about a target (for example a wild life species).
Based on this selective transmission, the IoT platform can conserve a significant battery power and hence prolong its operational life. However, the key challenge in designing such IoTs is that the integrated classifier needs to be robust and highly energyefficient. While deep neural network (DNN) based classification systems can achieve very high accuracy
[15][31], there exist certain limitations when applying them to IoTs for rareevent detection. First, by the nature of the problem, the training data corresponding to the rare event is sparse and might not be suitable for DNNs. Even if it were possible to train a DNN for deployment, a compressed or a quantized variant of DNN, like the Binary Neural Networks (BNNs)[41], has to be used to optimize computational resources. Retraining BNNs on the IoT platform to account for data and hardware drifts is challenging due to quantization effects. Fullprecision training is not possible due to limited computational resources. Also, if the parameters of the DNNs can be quantized, the input features cannot be significantly quantized without affecting the classification accuracy. KNearest Neighbour (KNN)
[44][29] and Support vector machines (SVMs) [36][25], on the other hand, have been shown to generalize well with sparse training data [26]. However, SVM is more robust to outliers, and the convexity of SVM training ensures any recalibration is interpretable and stable. There have been many instances where SVMs have performed well as an acoustic classifier
[37] [46]. In literature, several approaches have been proposed to reduce the computational and memory footprint of SVMs [23][7][3][30]. However, in these platforms, computing features and classification are generally treated independently, both during training and inference.In this paper, we present an infilter computing framework that exploits the computing and nonlinear primitives in the feature extraction process to design ultralight IoT acoustic classifiers. The approach is motivated by the fact that acoustic frontends like the neuromorphic cochlea [43] can be designed to be highly computationally efficient using different degrees of linear and nonlinear transformations. Our goal is to systematically exploit and map these nonlinear transformations into the kernel functions used in SVMs, such that both classification and feature extraction are cooptimized for training and inference. This results in a templatebased SVM [20][33]
architecture that has an ultralow computational footprint for inference and training. This feature not only relaxes communication bandwidth requirements on the IoT system but also allows recalibration (retraining) to account for statistical drifts. The main advantage of using templatebased SVM is the ability of the framework to use arbitrary functions without any restriction on its properties, like positivedefinite kernels for traditional SVM. This allows us to use hardwarefriendly mapping or functions that need not be specified in a closedform, such as using an ordinary differential equation (ODE). This property is beneficial especially for hardware implementation, where the inherent nonlinearity of the device can be used as a kernel rather than engineering a specific nonlinearity. As a proofofconcept, we have applied the infilter computing framework using an acoustic feature extractor based on Cascade of Asymmetric Resonators with Inner Hair Cells (CARIHC)
[43] [21] [22]. The CARIHC model exhibits inherent nonlinearity and hence performs well as a kernel for classification. We believe that our proposed framework has the following key advantages:
A templatebased SVM architecture that allows an arbitrary function to be used as a kernel, unlike a conventional SVM that requires a positivedefinite kernel.

Combining the feature extraction and SVM kernel into one function makes the system ultralight and computationally efficient.

The memory footprint of the proposed system is userdefined and can be specified based on the IoT hardware constraints.

A novel fast training algorithm with reduced training complexity in terms of memory and computational complexity.

A system that can scale without affecting significant hardware changes due to the timemultiplexing approach allows the framework to deploy for more complex tasks.
As proof of concept IoT implementation, we have implemented this inference framework on Xilinx Spartan 7 series FPGA [42], a lowcost and lowpower FPGA. We have validated our architecture on various auditory datasets such as the environmental sound dataset [28] and speechbased dataset[17].
The rest of this paper is organized as follows. In Section II, a brief discussion of related work is provided, followed by section III, where we present the modified templatebased SVM algorithm and explain the uniqueness of the formulation. In Section IV, we explain the novel training algorithm used for our framework. Section V provides the FPGA implementation details. Section VI provides results obtained with an audio based dataset for detection and surveillance applications. Section VII concludes this paper and provides some useful applications, and discusses possible future work using this framework.
Ii Related Work
Hardware implementations of SVM using FPGAs have been successfully achieved over the years with high accuracy and the least possible area and power. Binary classifications or even multiclass classifications using a Modified Oneagainstall (MOAA) approach for SVMs have been implemented on FPGAs [27] [1]. Since the kernel is one of the most important parts of the SVM algorithm, the kernel function consumes maximum resources in implementation. This is demonstrated in [23] with linear and nonlinear SVM implementations on FPGA. The authors show that nonlinear kernel implementations use more resources than linear kernels. However, at the same time, there is a drop in accuracy by more than 10% when using a linear kernel compared to a nonlinear kernel. The authors implement a kernel with parallel inputs enabling high operating frequency but at the cost of high resource utilization in terms of LUTs and DSPs. This shows that in order to get good classification, we require a nonlinear kernel, but at the same time, we need to achieve hardware efficiency for an ultralight implementation.
Regarding acoustic feature extraction, acoustic signals require a certain amount of preprocessing to extract the salient features before it is used for classification. One such FPGAbased approach is detailed in [7]
. The authors use (Discrete Wavelet Transforms) DWTs for feature extraction from a given audio signal. This DWT feature extraction forms the input to a standard SVM having a Radial Basis Function (RBF) kernel, which is nonlinear. This classification system is used for phoneme recognition using data from the TIMIT dataset. Due to hardware constraints and the complexity of the DWT algorithm, the authors chose to implement only the SVM classifier on the FPGA. The acoustic signals are preprocessed using a software implementation of the DWT algorithm and are provided as inputs to the SVM hardware. This implementation has the disadvantage of offline software feature extraction, making the hardware incapable of using unprocessed acoustic signals as inputs. At the same time, the SVM hardware implementation consumes a high number of FPGA resources in terms of LUTs and DSPs. Also, the weights and support vectors from the SVM training are stored in external ROMs. This makes the implementation impractical for a small IoTbased edge device.
Furthermore, timeseries data need not always be a speech signal, and there may be cases where we may need to classify nonauditory timeseries signals. In [3], authors use Melfrequency cepstral coefficients (MFCC) technique to extract salient features from pulmonary sounds to detect wheezing using standard SVM classification. Here, MFCC, as well as SVM, was implemented on FPGA. This implementation provided an endtoend solution on hardware that could classify between a normal and an abnormal pulmonary sound. In this implementation, MFCC itself is a resourceheavy algorithm, and additional hardware is required for the SVM classifier to be implemented. Also, ROMs store support vectors and weights along with additional registers to store MFCC coefficients. The MFCC coefficient calculations, which are being done on hardware, also contribute to high DSP usage. The authors have demonstrated their hardware capability using only a 6 kHz input sampling frequency, making the hardware limited in terms of the flexibility of signals that it can process. Hence, such a system cannot be used in an IoT edge device due to the high resource utilization and rigidity.
Another representative example of an IoT for acoustic classification is a speaker identification system used in security systems. One such system was realized on FPGAs in [30]. Similar to the implementation in [3], the authors implemented an SVM classifier with MFCC as the feature extractor on hardware. The input data was sampled at 8 kHz, making it resourceefficient, but at the same time, it was less flexible in terms of processing signals of higher sampling frequency. External SRAM was used to store the MFCC coefficients and training parameters. Despite having a slight improvement in terms of hardware efficiency compared to the previous implementation, this implementation lacked flexibility and still had a significant amount of resource usage, given the hardware constraints applicable for an IoT device.
Our framework addresses all the shortcomings of prior works by having a neuromorphic cochleabased CARIHC kernel integrated inside a templatebased SVM system. This kernel exhibits nonlinearity for better classification and, at the same time, inherently provides a robust feature extracting capability in order to get a good classification. This kernel has multiple tunable parameters which can be adjusted to get the best feature extraction depending on the application. The templatebased SVM provides the flexibility of choosing the right number of templates as support vectors, which can be tuned as per the application. This avoids the additional requirement for the storage of support vectors. Flexibility, scalability, low resource usage, and low power make this framework ultralight and ideal for IoT deployment for many applications.
Iii Templatebased SVM and Infilter Computing Formulation
Rooted in statistical learning theory, an SVM minimizes the structural risk by maximizing a classification margin over a set of training samples
[6]. In the case of acoustic classification where the input is a timeseries signal, one can define a data vector (for training and inference) at a timeinstant as constructed using a window of previous samples of the signal . An SVM based binary classifier produces a decision label corresponding to the data vector according to(1) 
where is given by
(2) 
Here is a subset of the training vector called support vectors with their apriori known decision labels . is a positivedefinite kernel function that is also chosen apriori and plays an important role in implementing nonlinear decision functions. and are training parameters, corresponding to the support vector and is determined by solving a standard quadratic program based training procedure[6]. Note that the memory requirements to implement an SVM inference engine in hardware is proportional to the number of support vectors , and hence in literature numerous techniques exist to reduce
using heuristic methods
[14] [40].For conventional SVMbased acoustic classifiers[13], as shown in Fig.2, the raw input signal is preprocessed by a feature extraction module or function before providing as an input to the SVM kernel. is the feature dimension. The eq.(2) can be reexpressed as
(3) 
In literature, the kernel function and the feature extraction function are typically chosen independently. As a result, the memory footprint of the SVM is determined by the complexity of the problem and the discriminative power of feature extraction. Note that a typical acoustic feature extraction function , itself comprises several nonlinear transformations that could directly be used as SVM kernels. However, for the SVM formulation to be valid, the nonlinear transformations must be mapped to a positivedefinite kernel. In [20] we reported a mechanism to design SVMs using arbitrary template functions within a fixed memory footprint. The approach expressed the kernel in eq.(2), as an outerproduct over template functions as . The template functions then could represent feature extraction modules. Following the derivations in [20], the SVM function can be rewritten as:
(4) 
Here,
can be viewed as a consolidated training parameter that can be estimated using a reducedcomplexity training procedure described in section
IV. Note that the memory footprint of the reformulated templateSVM is determined by the number of template functions , and each of the template functions could be chosen arbitrarily. Here, can be any function that can be used to express the input features in order to make a classification. This makes this framework flexible to implement various functions for classification. For example, in Fig.2, we illustrate how a cascade of filters a CARIHC feature extraction module could be used to implement , and described in the following section.Iiia CARIHC model as SVM Kernel
The biological cochlea is a nonlinear and causal system. This nonlinearity makes it ideal to use a cochlea model as an SVM kernel since it would give robust classification in higher dimensional space [10]. One such auditory filter model is the Cascade of Asymmetric Resonators with FastActing Compression (CARFAC) model, which is a digital version of the cascade of polezero filter [43][22] [38]. It consists of CAR block, which mimics Basilar Membrane (BM) functionality, IHC, Ganglion cells, and Outer Hair Cell (OHC). We use the CAR and IHC modules of this model for our kernel.
Given with samples and each sampled input data as described previously. The system receives an audio sample at each sampling clock which is fed to the first CAR block . There are CAR blocks arranged in a cascaded manner as shown in Fig.3. The eq.(5) denotes the two pole two zero filter which mimics BM implemented as a CAR filter.
(5) 
are the resonator filter coefficients for each filter, is the polezero radius in the zplane, and is the DC gain factor. The CAR block transfer function is derived in detail in Appendix.
For the first CAR filter, the input is , and due to the cascade nature of the CAR filters, the output of one filter becomes the input of the next stage filter. The output of each CAR filter is denoted by as shown in Fig.3, which forms the input to the IHC blocks in parallel. We use a simplified model of the IHC implemented using half wave rectifier (HWR), , as per eq.(6).
(6) 
From Fig.3, in eq.(6), which gives,
(7) 
The IHC generates output as per eq.(7). The IHC output is summed over samples, and this forms the input for the standardization (std) blocks in parallel.
(8) 
Here, .
(9) 
where with as the training samples, and
=
(10) 
Here, . The summation over samples of the output of IHC is taken as per eq.(8) for each filter. Then standardization technique, commonly used in neural network optimizations [32], is applied across training input samples as per eq.(9). Note that and are calculated only during training, and these vectors are passed as learned parameters to the inference engine. Therefore, an input signal vector sampled at a sampling frequency generates samples with each sample denoted as . It is then processed by a cascade parallel arrangement of neuromorphic cochleabased CARIHC filters to estimate the kernel function with as the filter stage out of filters as per (10). The output is a kernel vector, as shown in Fig.3. The classification output is produced using eq.(4) employing this kernel vector, the output weight vector , and the bias obtained after the training process.
Iv TemplateSVM Training
A conventional SVM training involves solving a quadratic optimization problem over a set of training data , of size [6]. The optimization can be expressed as:
(11)  
(12)  
(13) 
Here, is a hyperparameter that is chosen through crossvalidation and . Due to the quadratic nature of the optimization problem, the worstcase complexity of SVM training scales as . In practice, the training complexity scales as , where is the number of support vectors. However, the number of support vectors is unknown, so any SVM formulation has to accommodate the worstcase scenario.
In the templatebased SVM, the kernel is expressed as an outerproduct over a set of templates as . Substituting in equation (11), the templateSVM training reduces to a lower complexity quadratic optimization problem as:
(14)  
(15)  
(16)  
(17) 
Here, is iterated over training samples. Equations eq.(15), (16), (17) are the constraints imposed on eq.(14). The equation eq.(14) shows that the optimization complexity has been reduced to with additional constraints that can be controlled based on the number of templates required for the application. This reduced complexity enables us to use this training algorithm to implement IoT devices, making them adaptive and deployable in dynamic environments. Thus, our framework is capable of online training. For the infilter computing, the features or templates are computed as an input stream. Training the template SVM entails solving a simplified constrained quadratic problem. Thus, the architecture can be trained in an online manner. However, in a traditional SVM, the features would first need to be accumulated and fed to a kernel module. Note that there are several ways to efficiently solve the constrained optimization in eq.(14), including both batch and online variants. In the accompanying software for templatebased SVM[33], we have used a growthtransformbased approach [9] to solve eq.(14).
V FPGA Implementation of our SVM classifier
We demonstrate the efficiency of our infilter computing architecture by implementing it on FPGA. This system is configurable as the infilter parameters, and weight vectors are tunable based on the application. The weight vector and biases are trained offline, as mentioned in the previous sections. Initially, we simulated this framework in floatingpoint in MATLAB software tool. In order to estimate the appropriate FPGA implementation, we simulated the model in fixedpoint code. The CARIHC kernel is implemented in a 12bit fixedpoint code. The trained weights ( and bias () are stored as 8bit, and the mean (
) and standard deviation (
are stored as 12bits. In our experiments, we analyzed that using 12bits for inputs, filter coefficients and standardization parameters (mean and standard deviation) with 8bits for weights and bias resulted in minimal accuracy degradation and reduced hardware resource utilization. We use pipelining to speed up the kernel execution in FPGA. The CARIHC kernel filters are executed using the timemultiplexed technique where each filter uses the same hardware for generating output which makes the design small in area.



Clock Frequency  25 MHz  
Audio Sampling Frequency  16 kHz  
Number of Filters  30  
Dynamic Power  8 mW  
DSPs  4 (Total 10)  
LUTs  1517 (Total 3750)  
FFs  2864 (Total 7500) 
Fig.4 shows the hardware execution flow, and Fig.5 shows the FPGA block architecture of the system. The input sample () sampled at is provided as input to CAR block and filter coefficients (), stored in the FCMEM memory block. The CAR block performs filtering as per eq.(5) followed by the IHC block, which performs half wave rectification as per eq.(6). The detailed implementation of the CARIHC block is shown in Fig.6. The timemultiplexed processing of each cascaded filter is determined by the select signal. If this signal is set, then the next sample () enters the CAR block. This signal is set only when the processing of all the filters is done. Hence, the same CAR block is used to process filters using the stored filter coefficients for each filter (). The and select lines control this filter coefficient flow. So the input to output delay is directly proportional to the number of cascaded blocks, in this case, i.e., the number of filters used. The multipliers have been designed to operate in a pipelined manner, where multiplication of the coefficients will take multiple clock cycles for producing the output by reusing the hardware. The half wave rectification operation in the IHC block depends on the Most Significant Bit () select signal, determining the sign bit of CAR output (). This results in discarding the values below zero to produce the rectified output.
The output of IHC block () is summed over the entire window of the input data, i.e., summation of input samples across filters. Standardization of the output () of summation is performed based on eq.(9) in the STD block. The size of is 26bit due to 16000 () additions, and the standardization parameters, i.e., mean () and standard deviation (), are 12bits, which are fetched from the SMEM memory block. We quantize to 12bit and further perform another level of quantization to 8bit after the standardization operation. Heuristically, we found that using these multiple quantization levels achieves the lowest possible hardware resource utilization without impacting the classification accuracy. The output of standardization block () is now used to perform multiplyaccumulate (MAC) operation with the learned weights (), which are also 8bits, stored in the WMEM memory block. Finally, the classified output is obtained after the bias (), stored in the BMEM memory block, is added to the output of the MAC block, as per eq.(4).
Our system uses a 25 MHz system clock, and 16bit input sampled at a 16 kHz audio sampling rate. We have used 30 filters in all the reported results here. However, it is parameterizable and can be changed based on the application requirements. Every input data sample takes about 300 clock cycles, i.e., 12s, to be executed through the pipeline. We have an audio input sample being given to the system every 1562 clock cycles, i.e., 62.5s. We have a buffer of around 1200 clock cycles, i.e., about 48s, before the arrival of next sample, where the, system is idle. This shows that we can increase the sampling frequency to 80 kHz, i.e., sample at every 312 clock cycles without impacting the hardware architecture. On the other hand, for a sample of 16 kHz frequency, we can also increase the number of filters to up to 120 to use up the extra 48s. The number of clock cycles required to execute a single audio sample increases linearly with the number of filters. There is an increase of about 2.5 % of area in overall hardware and 0.23 mW increase in power for every addition of filter. The increase in the number of filters may be required for complex auditory tasks. We can also reduce the operating frequency to as low as 400 kHz to reduce the power consumption of the system to below 1 mW. We can reduce it further to a few kHz if we reduce the input sampling frequency to a few Hz in other timeseries data such as EEG/ECG signals. Hence, this shows that our system is highly flexible and scalable to suit any application in the timeseries domain.
Classes (Train/Test)  Traditional SVM  Infilter SVM (This Work)  


Floating pt.  Floating pt.  Fixed pt.  
Train  Test  Train  Test  Train  Test  
Dog (129/33)  51  88  87  89  90  89  87  
Rain (119/40)  60  89  87  89  87  82  82  
Sea_Waves (200/50)  113  86  82  84  78  80  74  
Crying Baby (144/49)  58  91  83  91  87  93  85  
Clock Tick (114/50)  86  91  86  92  88  92  85  
Person Sneeze (101/44)  43  83  77  89  82  82  80  
Helicopter (197/50)  48  94  88  96  90  95  85  
Chainsaw (99/34)  39  92  85  93  85  93  82  
Rooster (124/54)  36  92  94  93  96  93  96  
Fire Crackling (152/66)  46  90  89  90  87  89  86 
Classes (Train/Test)  Traditional SVM  Infilter SVM (This Work)  


Floating pt.  Floating pt.  Fixed pt.  
Train  Test  Train  Test  Train  Test  
Theo (761/254)  247  92  91  93  91  89  88  
Nicolas (889/297)  197  98  98  98  97  97  94  
Ywewelver (749/250)  196  92  90  94  91  89  88  
Jackson (796/200)  35  99  99  99  99  99  98 
We use Xilinx Spartan series part xc7s6cpga196, a lowpower FPGA manufactured on a 28 nm technology node.The Spartan series FPGA caters to edge computing and IoT platform systems, as the area footprint and power envelope are low. The dynamic power consumption for our system on FPGA is 8 mW. The logic design i.e., LUTs and registers, consumes around 2 mW of power, whereas the control signals take up 4 mW. DSPs consume 1 mw of power, and the clocks take up the remaining 1 mw of power. The weights and bias generated from the training procedure are quantized to 8bit, and the CARIHC model uses 16bit input samples to generate 16bit output. This 16bit kernel output data on accumulation over 16k samples increases to 30bit, which is then reduced to 8bit after a standardization and quantization operation. For this design, 16 kHz sampling rate with 30 filters uses 1517 LUTs and 2864 FFs summarized in Table I. This exhibits that the system can be implemented with minimum area and low power and hence suitable for IoT deployable edge devices.
The resource utilization comparison contrasts related work with our system, as shown in Table II. Our work has the advantage of being low in resource utilization compared to other works. Most of these systems use acoustic signals as input, with MFCC as the feature extractor and SVM as the classification algorithm. MFCC is a widely used feature extractor for acoustic signals since it extracts linearly separable features amongst most acoustic signals. Our framework uses a neuromorphic cochleabased kernel that acts as a feature extractor as well as a nonlinear kernel for the SVM algorithm. This avoids the need for a separate feature extractor compared to other works. Our framework also does not require separate storage for support vectors, and at the same time, we have control over the number of weights that have to be stored based on the required application. Another advantage our system has over other systems is that it is highly tunable and is scalable for higher or lower sampling frequency input signals.
Vi Results and Discussion
We use datasets from two domains, namely speech and environmental sounds. Speech datasets prove the usability of our framework in security like voicebased access, where we can identify the speaker and provide biometric access. The environmental sounds showcase the framework’s versatility which can be deployed for multiple sounds as the target for robust classification. We use MATLAB for software simulations and verification of the algorithm. The FPGA design implements the MATLAB code using fixedpoint arithmetic.
Environmental Sounds Classification (ESC10) dataset [28] consists of sound clips constructed from recordings publicly available through the Freesound project. It consists of 400 environmental recordings with 10 classes, i.e., 40 clips per class and 5 seconds per clip. Each class contains 40 wav format audio files. These clips had a lot of silence, so we trimmed the silence part and further trimmed the remaining clips into 1 second version belonging to the same class, thus increasing the dataset’s number of samples. Table III shows the class labels, which depict the wide variety of data samples used. The classes include sounds from dog bark, rain, sea waves, crying baby, clock ticking, person sneezing, helicopter, chainsaw, crawing rooster and fire crackling. Here, the dataset was used to create balanced classes to identify one class versus other classes arranged randomly. The train and test accuracy values are shown with the traintotest ratio mentioned in the bracket. One thing to note from these results is that with less amount of data too, our framework could classify the sounds. We have compared our results with traditional SVM, which uses inputs after being preprocessed using the same CARIHC filters. For the traditional SVM, we use the inbuilt MATLAB library with default command lines. The number of support vectors for traditional SVMs is significantly higher than the number of filters used in our work, indicating that we can get comparable accuracy with lower hardware resources and can be used in lowpowered devices. As the number of samples is low, we see lower accuracy for classes like clock tick and person sneeze. These classes have a lot of overlapping information with other classes, causing confusion.
The Free Spoken Digit Dataset (FSDD) [17] is an open dataset consisting of recordings of spoken digits in wav files at a sampling rate of 8 kHz. The recordings are trimmed so that they have near minimal silence at the beginning and ends. It consists of 4 speakers with 500 recordings per speaker, amounting to an overall of 2000 recordings. These recordings are English pronunciations of each digit from 0 to 9 by each speaker. We use our framework to identify the speaker based on the recordings. We create recordings of each speaker versus a random pool of remaining speakers. We can tune our system to each speaker and get a classification to identify whether our target speaker is speaking or not. Similar to ESC dataset results, FSDD results in Table IV also show that traditional SVM requires many more support vectors than the number of filters used in this work. As the number of support vectors is significantly higher for traditional SVMs, we see a slight reduction in accuracy for few classes in infilter SVM. For the proposed infilter SVM, the number of template vectors is determined by the fixed number of filters. The training algorithm tries to find the best possible solution within this fixed constraint. However, adhering to this constraint is one of the reasons for the reduction in accuracy. The other constraint with the proposed infilter SVM approach is that the final solution is linear for the CARIHC (filter) function. Any nonlinear mapping is implemented only by the CARIHC function. Whereas in a standard SVM formulation that uses the CARIHC filter output as features, there is additional nonlinearity in the kernel mapping. Thus, the traditional SVM may be able to exploit this crossfilter nonlinearity to achieve better accuracy. FSDD classification showcases the capability of the framework to identify the right person, which can be used in giving access to a secure area or facility.
We see from Tables III and IV that the number of support vectors () for the traditional SVM is always greater than the number of templates, i.e., filters () used for the proposed SVM (). Traditional SVM has a computational complexity of , where is the complexity of a linear kernel. In contrast, the complexity of the proposed work is . Thus, the computational complexity of traditional SVM increases with an increase in support vectors. We see from the results that the number of filters for infilter SVM is less than the support vectors used in traditional SVM. As a case study, we take Yweweler class data from the FSDD dataset. The number of MAC operations required to classify this class in traditional SVM is 5096 compared to 30 MAC operations for infilter SVM. We know that MAC operations consume maximum resources and, in turn, would increase the power consumption in any hardware design. Hence, our framework is efficient in comparison to an equivalent SVM hardware implementation.
We can tune the number of filters based on the application. The number of filters is determined by the tradeoff between the hardware constraints (memory and speed) versus accuracy. Empirically, we were able to determine the optimum number of filters required for most datasets. Reducing the number of filters reduces the discriminatory information encoded by the features, and hence we observe a reduction in accuracy. We can tune the number of filters based on the application. As seen in the added Fig.7, increasing the number of filters beyond a specific value yields a marginal increase in accuracy. Thus, this marginal increase in accuracy would come at the cost of latency and increase in hardware resources, as explained in Section V. We chose 30 filters to satisfy the constraints of our implemented design. Hence, the same number of filters were used for the datasets. This shows that we can fix the number of filters based on the constraints and still obtain comparable results.
We performed an experiment to check the classification robustness of our framework. For this purpose, we added white Gaussian noise to the test input signals from the existing dataset and observed the accuracies across different Signal to Noise Ratios (SNRs). We used the MATLAB tool function
to add the white Gaussian noise to the signals. Fig.8shows the mean and variance plot of the test accuracy due to the addition of noise over 10 iterations. Here, we see that our framework is quite robust when we train the data with the added noise, and with no noise in training data, the test accuracy falls below 80 % as we reduce the SNR to below 25 dB.
Timeseries Data 



Low  High  
Speech [34]  100  8k  
Music [34]  40  18k  
Accelerometers [24]  0.5  1.5k  
ECG [2]  0.1  1k  
EEG [35]  1  100  
EMG [19]  24  400 
In general, we can see from the dataset results that our work produces comparable results when the number of filters is close to the number of support vectors used in traditional SVM. This shows that we can choose the number of filters beforehand and arrive at an acceptable accuracy for the required application without relying on the algorithm to decide this hardware parameter. For each type of dataset, we need to tune the filter parameters for efficient classification. We also need to determine the number of filters used, i.e., the template vectors in SVM formulation based on multiple runs. This makes our framework highly flexible and tunable as per the application’s needs. In all our experiments, we have used 30 filters. The fixed point code consists of 16bit CARIHC kernel output generated using 16bit input, and the weight and bias are limited to 8bit values. From the results across these datasets, we see that our framework is good at identifying a person using speech. Also, the ESC10 dataset results exhibit the framework’s capability even to classify inanimate sounds that can be used in systems where such classifications can trigger a more finetuned action for corrective measures. Hence, by tuning the CAR filters to a certain frequency range, we can classify different timeseries data as per Table V. Here, our framework can be configured for a wide range of frequencies, enabling it to use various sensors generating timeseries data. This gives the flexibility of programming the framework for a specific application. Also, by determining the number of filters required for each type of classification, we can optimize the classification accuracy for any timeseries data.
Vii Conclusion
In this paper, we have demonstrated our novel SVMbased acoustic classifier using the cochlea module as kernel and feature extraction stage simultaneously. The neuromorphic cochlea kernel of our unique algorithm does not require the kernel to be positive definite. This lack of restriction compared to traditional SVM enabled us to use cochleabased CARIHC function as a kernel in our framework. Furthermore, the proposed system has the flexibility of handling different kinds of timeseries data, as the kernel filter parameters can be tuned as per the frequency range of the input signal. This templatebased SVM has a fixed number of templates in contrast to varying support vectors in traditional SVMs. We can control the operating frequency by controlling the number of kernel filters, making it power efficient. This can be finetuned by matching the hardware constraints with the required application speed. Also, since the complexity of this novel SVM is low compared to traditional SVM, our framework is capable of performing online training. This flexibility and dynamic behavior make the framework ideal for implementing in IoT edge devices. In this paper, we have demonstrated the hardware efficiency of the infilter computing framework on FPGA. However, this can be extended to create a custom hardware and used as a batterypowered edge device. In the future, we plan to deploy this framework in different environments as an edge classification device. We can use this algorithm on an embedded system like a microcontroller for greater flexibility in programming the device. Leveraging the reprogramability of our framework, we can build an IoT system that can be used to monitor various timeseries data using a network of sensors placed at various locations for different applications. The proposed system can have several potential applications ranging from identifying animal behaviour pattern for ecologists using sensors placed in strategic locations in a forest area to healthcare data analysis using wearable sensors which provide timeseries data like ECG or EEG data. Based on bird species sounds or any animal sound, we can track the presence of different rare species of wildlife in a particular environment over a period of time. In this case, we can remotely reprogram the hardware to detect different wildlife species as many of these species might not be active in a specific region for a particular season. Similarly, such systems can also be deployed for remote health care applications using signals like ECG/EEG or ultrasound for early disease detection [8] [18] [45] and for automation of industrial maintenance of machinery using various timeseries data produced by mounted sensors. All these deployments lead to minimizing human intervention and reducing errors caused by logistics issues. Since our system can classify rare events with very low power consumption, we can deploy this system as an alwayson system.
Symbols and Acronyms  Description  

Sign function  
Square root function  
Arithmetic mean function for a series  
Find that minimizes the function  
Such that  
Big O notation for complexity  
FPGA  Field Programmable Gate Array  
SVM  Support Vector Machine  
CARIHC 


LUT  LookUp Table  
FF  FlipFlop  
UGS  Unattended Ground Sensor  
DNN  Deep Neural Network  
BNN  Binary Neural Network  
KNN  KNearest Neighbour  
MOAA  Modified OneAgainstAll  
DWT  Discrete Wavelet Transform  
RBF  Radial Basis Function  
DSP  Digital Signal Processing  
ROM  Read Only Memory  
MFCC  Melfrequency cepstral coefficient  
SRAM  Synchronous Random Access Memory  
ODE  Ordinary Differential Equation  
CARFAC 


OHC  Outer Hair Cell  
BM  Basilar Membrane  
HWR  Half Wave Rectifier  
MSB  Most Significant Bit  
MAC  MultiplyAccumulate  
EEG  Electroencephalography  
ECG  Electrocardiography  
ESC10  Environmental Sound Clips10  
FSDD  Free Spoken Digit Dataset  
EMG  Electromyography  
SNR  Signal to Noise Ratio 
[CARIHC Filter Formulation]
A two pole two zero filter forms the asymmetric resonator whose transfer function is as below:
(18) 
The two pole coupled form has a pair of conjugate poles ( and ):
(19)  
(20) 
where is the pole angle in the z plane. The conjugate zeros ( and ) are:
(21) 
where is the zero angle in the z plane. The zero radius is the same as the pole radius, . The condition for complex zeros becomes relevant for highfrequency channels, where :
(22) 
(23) 
Here, can be used to move the zeros and the poles simultaneously while is fixed. determines the distance between the poles and the zeros. The frequency of zeros are kept slightly higher than the poles. If we increase the value of , the poles and zeros grow further apart, giving a slow roll off at higher frequencies. On the other hand, if the value of is decreased to a low value, the poles and zeros grow closer, giving rise to sharp roll off making it asymmetric. This sharp roll off is similar to the characteristic exhibited by auditory filtering. This property also enhances selection of frequencies. In order to keep the pole frequency half octave below zero frequency, is kept sames as . To get unity gain at DC, we can solve for g as follows:
(24) 
The zerocrossing times of the filter’s impulse response does not change with respect to time, even when we change .
(25) 
where controls the damping factor, is defined in eq.(26) and is the sampling frequency. keeps the damping away from zero and also makes the damping bounded. Changing means varying the poles and the zeros of the filter. This satisfies the biologically observed condition where variation in stimulus level does not vary the impulse response zero crossings [22]. For each cascade stage, the initial values for zeros and poles are set. The Greenwood map function [12] is used to choose equidistant poles of the two pole two zero resonator. These are placed along the normalized length of the cochlea.
(26) 
where, is the frequency of the pole and is the normalized position along the cochlea, varying from 0 at the apex of the BM, to 1 at the basal end.
Acknowledgment
This research was supported in part by (i) INSPIRE faculty fellowship (DST/INSPIRE/04/2016/000216), SPARC grant (SPARC/20182019/P606/SL) from Ministry of Human Resource Development and IMPRINT Grant IMP/2018/000550 from the Department of Science and Technology, India. The authors would like to acknowledge the joint Memorandum of Understanding (MoU) between Indian Institute of Science, Bangalore and Washington University in St. Louis for supporting this research activity.
References
 [1] (2020) FPGA implementations of svm classifiers: a review. SN Computer Science 1, pp. 1–17. Cited by: §II.
 [2] (2018) A survey on ecg analysis. Biomedical Signal Processing and Control 43, pp. 216–235. Cited by: TABLE V.
 [3] (2018) Efficient fpgabased architecture of an automatic wheeze detector using a combination of mfcc and svm algorithms. Journal of Systems Architecture 88, pp. 54–64. Cited by: §I, §II, §II, TABLE II.
 [4] (2000August 8) Multiuser remote health monitoring system. Google Patents. Note: US Patent 6,101,478 Cited by: §I.
 [5] (2007) Submicrowatt analog vlsi trainable pattern classifier. IEEE Journal of SolidState Circuits 42 (5), pp. 1169–1179. Cited by: §I.
 [6] (1995) Supportvector networks. Machine learning 20 (3), pp. 273–297. Cited by: §III, §IV.
 [7] (2013) Hardwarebased support vector machine for phoneme classification. In Eurocon 2013, pp. 1701–1708. Cited by: §I, §II, TABLE II.
 [8] (2019) Optoacoustic imaging and grayscale us features of breast cancers: correlation with molecular subtypes. Radiology 292 (3), pp. 564–572. Cited by: §VII.
 [9] (2017) Extended polynomial growth transforms for design and training of generalized support vector machines. IEEE transactions on neural networks and learning systems 29 (5), pp. 1961–1974. Cited by: §IV.
 [10] (2002) Mercer kernelbased clustering in feature space. IEEE Transactions on Neural Networks 13 (3), pp. 780–784. Cited by: §IIIA.
 [11] (1999) Detection and classification for unattended ground sensors. In 1999 Information, Decision and Control. Data and Information Fusion Symposium, Signal Processing and Communications Symposium and Decision and Control Symposium. Proceedings (Cat. No. 99EX251), pp. 419–424. Cited by: §I.
 [12] (1990) A cochlear frequency position function for several species 29 years later. The Journal of the Acoustical Society of America 87 (6), pp. 2592–2605. Cited by: §VII.
 [13] (2003) Contentbased audio classification and retrieval by support vector machines. IEEE transactions on Neural Networks 14 (1), pp. 209–215. Cited by: §III.
 [14] (2009) Support vector reduction in svm algorithm for abrupt change detection in remote sensing. IEEE Geoscience and Remote Sensing Letters 6 (3), pp. 606–610. Cited by: §III.
 [15] (2017) CNN architectures for largescale audio classification. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 131–135. Cited by: §I.
 [16] (1994) Detection of human speech in structured noise. In Proceedings of ICASSP’94. IEEE International Conference on Acoustics, Speech and Signal Processing, Vol. 2, pp. II–237. Cited by: §I.
 [17] (2016) Free spoken digits dataset. Note: https://github.com/Jakobovski/freespokendigitdataset Cited by: §I, §VI.
 [18] (2020) Contrastenhanced us with sulfur hexafluoride and perfluorobutane for the diagnosis of hepatocellular carcinoma in individuals with high risk. Radiology 297 (1), pp. 108–116. Cited by: §VII.
 [19] (1979) EMG frequency spectrum, muscle structure, and fatigue during dynamic contractions in man. European journal of applied physiology and occupational physiology 42 (1), pp. 41–50. Cited by: TABLE V.
 [20] (2019) Neuromorphic inmemory computing framework using memtransistor crossbar based support vector machines. In 2019 IEEE 62nd International Midwest Symposium on Circuits and Systems (MWSCAS), pp. 311–314. Cited by: §I, §III.
 [21] (1998) Filter cascades as analogs of the cochlea. In Neuromorphic systems engineering, pp. 3–18. Cited by: §I.
 [22] (2017) Human and machine hearing. Cambridge University Press. Cited by: §I, §IIIA, §VII.
 [23] (2011) FPGA simulation of linear and nonlinear support vector machine. Journal of Software Engineering and Applications 4 (05), pp. 320. Cited by: §I, §II, TABLE II.
 [24] (2017) Accelerometer data collection and processing criteria to assess physical activity and other outcomes: a systematic review and practical considerations. Sports medicine 47 (9), pp. 1821–1845. Cited by: TABLE V.
 [25] (2012) Abnormal human activity recognition using svm based approach. In 2012 International Conference on Recent Trends in Information Technology, pp. 97–102. Cited by: §I.
 [26] (2014) A comparative study of the svm and knn machine learning algorithms for the diagnosis of respiratory pathologies using pulmonary acoustic signals. BMC bioinformatics 15 (1), pp. 1–8. Cited by: §I.
 [27] (2010) A novel fpgabased svm classifier. In 2010 International Conference on FieldProgrammable Technology, pp. 283–286. Cited by: §II.
 [28] (2015) ESC: Dataset for Environmental Sound Classification. Proceedings of the 23rd Annual ACM Conference on Multimedia, pp. 1015–1018. Cited by: §I, §VI.
 [29] (2019) IoT load classification and anomaly warning in elv dc picogrids using hierarchical extended knearest neighbors. IEEE Internet of Things Journal 7 (2), pp. 863–873. Cited by: §I.
 [30] (2009) SVM speaker verification system based on a lowcost fpga. In 2009 International Conference on Field Programmable Logic and Applications, pp. 582–586. Cited by: §I, §II, TABLE II.

[31]
(2017)
Deep convolutional neural networks and data augmentation for environmental sound classification
. IEEE Signal Processing Letters 24 (3), pp. 279–283. Cited by: §I.  [32] (1996) Effect of data standardization on neural network training. Omega 24 (4), pp. 385–397. Cited by: §IIIA.
 [33] (2018) Note: https://github.com/aimlabwustl/GiniSVMMicro Cited by: §I, §IV.
 [34] (1931) Audible frequency ranges of music, speech and noise. The Bell System Technical Journal 10 (4), pp. 616–627. Cited by: TABLE V.
 [35] (2010) EEG signal analysis: a survey. Journal of medical systems 34 (2), pp. 195–212. Cited by: TABLE V.
 [36] (2008) SVMs modeling for highly imbalanced classification. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 39 (1), pp. 281–288. Cited by: §I.
 [37] (2006) Classification of acoustic events using svmbased clustering schemes. Pattern Recognition 39 (4), pp. 682–694. Cited by: §I.
 [38] (2014) FPGA implementation of the car model of the cochlea. In 2014 IEEE International Symposium on Circuits and Systems (ISCAS), pp. 1853–1856. Cited by: §IIIA.
 [39] (1995October 10) Method and apparatus for detecting the presence of human voice signals in audio signals. Google Patents. Note: US Patent 5,457,769 Cited by: §I.
 [40] (2004) A heuristic training for support vector regression. Neurocomputing 61, pp. 259–275. Cited by: §III.
 [41] (2017) Binary deep neural networks for speech recognition. Proc. Interspeech 2017, pp. 533–537. Cited by: §I.
 [42] (2018) Xilinx spartan 7. Note: https://www.xilinx.com/support/ documentation/productbriefs/spartan7productbrief.pdf Cited by: §I.
 [43] (2018) A fpga implementation of the carfac cochlear model. Frontiers in neuroscience 12, pp. 198. Cited by: §I, §IIIA.
 [44] (2020) Secure and efficient knn classification for industrial internet of things. IEEE Internet of Things Journal 7 (11), pp. 10945–10954. Cited by: §I.
 [45] (2020) Labelfree visualization of early cancer hepatic micrometastasis and intraoperative imageguided surgery by photoacoustic imaging. Journal of Nuclear Medicine 61 (7), pp. 1079–1085. Cited by: §VII.
 [46] (2018) Software defined radio and wireless acoustic networking for amateur drone surveillance. IEEE Communications Magazine 56 (4), pp. 90–97. Cited by: §I.
 [47] (2017) Internet of missioncritical things: human and animal classification—a devicefree sensing approach. IEEE Internet of Things Journal 5 (5), pp. 3369–3377. Cited by: §I.
Comments
There are no comments yet.