I Introduction
Internet of Things (IoT
) applications require a large number of heterogeneous devices to be distributed in a certain environment. Each of them can potentially generate a large amount of data to be sent via wireless transmission, affecting the energy autonomy and lifetime of devices. In addition, privacy issues increase. One successful approach is distributing the computation at the edge, i.e. performing local preprocessing, but also advanced processing (e.g. machine learning, classification), directly on the wireless node, at ”the thing” level
[4, 33]. Thanks to recent improvements in embedded technology, computationally powerful microcontrollers with consumption in the range of mW enables Artificial Intelligence (AI) at the edge. This reduces the amount of data transmitted, avoiding flooding an enormous quantity of raw data at the cloud level, and the related power consumption.Sound Event Detection (SED) is an example of IoT application where this approach can make a difference. In fact, since we are interested in events and not in raw data, a nearsensor processing approach is opportune. SED
, as well as acoustic scene recognition, can benefit from understanding events locally where they happen both in terms of privacy
[9] and reaction time, which can be kept in the range of . More than this, device lifetime can be guaranteed up to several years of operation, when energy harvesting is applied and the transmission is limited to few bytes. However, SED is a rather challenging task, especially when applied in outdoor contexts. After some pioneering efforts [39, 45, 23], which have not led to established solutions, recent progresses in deep learning and the release of sound event datasets and challenges like UrbanSound8K [35], AudioSet [10], ESC50 [28] and DCASE [24, 22] have reawakened interest in these applications, considerably improving the performance and paving the way to further developments. Nevertheless, advances in terms of accuracy and robustness of current acoustic event detection algorithms are achieved by using large neural networks, which are increasingly hungry in terms of computational power and memory. This prevents the development of applications for distributed monitoring in public spaces, which require a pervasive network of energy neutral devices composed of cheap, lowpower, lowcomplexity platforms. An attractive way to reduce the network complexity while preserving as much as possible its generalization capabilities is through Knowledge Distillation (KD) [14]. Taking advantage of the redundancy that characterizes large networks [18], KD allows training small networks capable of mimicking large ones.Going from stateoftheart neural models to actual implementation on an IoT device involves multiple stages, as depicted in Fig 1. In our previous publication [2], we presented a KD approach to compress a SEDclassifier composed of the publicly available VGGish feature extractor [13] and a recurrent classifier. Differently from common applications of KD, aimed at improving performance or at achieving limited reductions of the model dimensions, we obtained very high compression factors, reducing the network size from approximately 70 million parameters to nearly 20 thousand. These results paved the way to the effective use of deep neural networks on a lowpower microcontroller to enable SED.
In this paper, we focus in particular on i) a preliminary analysis of the computational and memory requirements to understand what kind of models can be afforded by a given class of microcontrollers; ii) the quantization of the network parameters and the activations, presenting two different strategies to select for each layer the best fixedpoint representation; iii) an implementation of the reduced network on a microcontroller with resources typical of an IoT endnode, building upon the network reduction strategies presented in [2] and iv) the evaluation of the accuracy of the actual implementation. In addition, we present an improvement of the KD approach presented in [2]
. Distillation is performed in two stages where adaptation of the VGGish pretrained feature extraction to the indomain data is separated from the actual parameter distillation, leading to a further improvement of the classification accuracy.
Ii Related Works
Nowadays, deep neural networks are stateoftheart for many classification tasks, like image classification and speech recognition. The trend is to use networks of continuously increasing size and complexity because they generalize better than shallower ones [1]. As a consequence, the most recent and advanced solutions may be not practical if limited computational resources are available, as in the case of IoT application contexts. In fact, IoT nodes can be used as simple sensing devices, transmitting all the information directly to the cloud ([15, 29]). Nevertheless, this transmission is usually expensive from an energy point of view; thus, processing data locally on the node can be preferred, especially when the throughput is considerably high (e.g., audio, inertial movement unit, video). Therefore, in those IoTrelated scenarios, reducing the network complexity and preserving as much as possible of its generalization capability is of significant interest. Fortunately, technology comes in to help, since the microcontrollers available nowadays are lowcost, energyefficient processing units with average computation capability that allows nontrivial processing on board. Nevertheless, these systems still have some severe limitations, for example, in terms of memory limited to up to hundreds of ).
Enabling advanced machine learning on IoT nodes is, therefore, of great interest and is becoming an attractive research topic for a variety of digital signal processing applications.
One way to achieve edge deep learning is by employing noncommercial platforms optimized for neural networks. As an example, the authors of [26] use a dedicated processing platform with stateoftheart energy efficiency for an ultralowpower deeplearningpowered autonomous nanodrone. In this way, no particular efforts are required in the network design thanks to the capabilities of the device. Processing time and memory footprint are further reduced by properly quantizing the network. Quantization is also relevant in [3] where, in combination with the use of the CMSISNN library, an embeddedC framework for neural network developed for Cortex M4M7 based microcontrollers [17], it allows a rapid and lowpower classification of thermal images. However, the neural architecture is rather small (threelayer Convolutional Neural Network (CNN)), and the resolution of the thermal images is shallow (8x8).
In the previous examples, either the device has adequate resources or the feature dimensionality is very small; thus, a deep learning algorithm can run on the embedded platform. The improvement obtained via quantization is somehow ”imposed” by the fixed point representation typical of an efficient microcontroller. However, for other classification tasks, in particular those involving the processing of audio streams, the approaches presented above are not practicable. The reason is that extensive neural networks, as well as feature vectors, are employed in stateoftheart solutions.
Keyword spotting is another field of interest that requires alwayson smart devices neartothe speaker and thus calls for energyefficient embedded systems with onboard recognition capabilities. A common strategy is to design and implement small networks, expressly fitted to the hardware capacity, that can be trained from scratch and implemented on lowpower lowcost microcontrollers. [38] shows the superiority of CNN compared to fully connected deep neural networks in terms of performance, number of parameters and operations. Tang et al. implemented a set of residual neural networks with specific compact structures, focused on reducing the overall number of parameters and operations [30]. Zhang et al. implemented a keyword spotter on a commercial microcontroller using fixedpoint quantization and a CMSISNN implementation [43], obtaining very short inference times.
A completely different strategy is to compress an existing model, generating a new network with a smaller memory footprint, but that effectively mimics the original one. In literature, several approaches exist to reduce the number of parameters of a neural network. Network pruning [12] aims at detecting and removing unimportant weights from a trained network until a given stop condition is reached. Matrix decomposition uses a compact format to represent the dense weight matrix of the fullyconnected layers using few parameters, preserving the expressive power of the layer [25]
. Matrix/tensor factorization
[8], that exploits the linear structure of networks [7], and vector quantization of weights [40] are other strategies to reduce the network memory size. These methods reduce the amount of memory needed to accommodate the network (e.g., sharing the weights). However, they keep the same architecture, therefore requiring the same buffers (RAM) and throughput. Besides, network pruning requires a manual setup of sensitivity for each layer and finetuning of the parameters.A further way to achieve model compression is to reduce the weight representation to very few bits. For example, BinaryConnect [5] and the related Binary Weight Net [31] represent weights with only 2 bits. If properly trained, the quantized networks can achieve performance close to the floatingpoint original models also on complex classification tasks [19]. However, the memory and computational cost reductions are not sufficient for implementation on IoT devices where few KB are available (going from 32 to 2 bits results in a compression factor of 16 in the memory footprint). Experiments in [19], in fact, do not address the lowcost lowpower devices we are targeting here. Additionally, nonconventional frameworks are needed to train the quantized network.
An attractive approach is to compress networks into a different and simpler architecture via KD [buciluǎ2006model]. This approach is also referred to as StudentTeacher because the smaller network (student) is trained to mimic the output of the larger one (teacher) [36]. The underlying idea is that the output of the neural network (soft labels) is more abundant in information than the hard labels and makes the training easier [14]. An example of network compression related to SED is [16]. Starting from the L
network for embedding extraction trained through selfsupervised learning of audiovisual correspondence in videos
[6], Kumari et al. compress this network targeting small edge devices, such as ”motes” that use microcontrollers and achieve longlife selfpowered operation. The work investigates the merging of different compression techniques (pruning, KD) and highlights the increase of performance using finetuning after compression.Our proposed approach differentiates itself from those available in literature in multiple directions since it attempts to pull together the benefits of the methods reported above. We start from a large model and compress it. However, instead of just focusing on pruning or weight sharing, which provides limited memory reduction without decreasing the processing time, we design a minimal target network and use KD to train it (instead of training from scratch as in [38, 30, 43]). Note that distillation is typically employed to obtain limited network reductions while, in this paper, we target extremely high compression factors. On top of this heavy compression, we apply a stochastic weight and buffer quantization without the need for retraining the network.
The overview provided in this section shows the scientific interest in bringing intelligence to the edge in application domains such as computer vision and audio processing. Our work is positioned in this broad research field, then focusing on
SED. To the best of our knowledge, this is the first attempt to use a studentteacher approach to perform sound event recognition directly on an IoT endnode in just 34.3 kB of RAM and 5.5 mW of power consumption.Iii Knowledge Distillation
In this section, we give a brief overview of the StudentTeacher approach. A more detailed review is available in [2]
, where we present our proposed distillation strategy based on a compound loss function. In addition to this, here we introduce a twostage distillation that provides a small but significant performance improvement.
Considering a generic architecture of a neural network, where the classifier follows a feature extractor, distillation can take place in different parts of the network. Figure 2 graphically shows the idea behind the distillation process. The upper part represents the teacher network, the lower part is the student, and arrows indicate where the loss between the two networks is evaluated to train the student.
The original approach, [14], replaces the hard labels with the teacher output in the soft loss:
(1) 
where is the set of input features, is the number of samples, is the number of classes, is a generic class, and
are the logits of teacher and student respectively for input
.A further strategy, in line with [32], is to make the student learn how to replicate also the features produced by the teacher. The embedding loss is thus defined as:
(2) 
where and are the feature vectors produced by teacher and student networks respectively.
In [2] we observed that the best solution is to combine different losses via a linear combination:
(3) 
where is the standard crossentropy using the hard (onehot) labels :
(4) 
Iiia Dataset
For SED in outdoor urban environments, three datasets are often used in literature: UrbanSound8K [35], AudioSet [10], ESC50 [28] and TUT Sound events 2017 [24]. The latter is particularly attractive because it features real recordings and several comparative methods are available thanks to the related DCASE challenges [22]. Unfortunately, the task is very hard and the stateoftheart accuracy is rather low. Moreover, the class distribution is highly unbalanced towards one class. Therefore the dataset does not allow a fair evaluation of KD methods. ESC50 is also rather in line with our application scenario, but its size is relatively small (3 minutes of audio per class) and does not allow generalizing the results. Finally, we also discarded AudioSet because its videobased labels, referring to scenes instead of isolated soundevents, require consistent additional work to be aligned with the label required for our analysis. Therefore, we focused on UrbanSound8K. It includes 8732 audio samples related to the city environment, with different sampling rates, number of channels and a maximum length of 4 seconds. Each recording has a unique label among 10 possible classes: air conditioner, car horn, children playing, dog bark, drilling, engine idling, gun shot, jackhammer, siren, and street music. All clips are taken from Freesound^{1}^{1}1”http://www.freesound.org”, a vast collaborative database of audio samples. Following the recipe reported in [35], we use 10fold crossvalidation and average scores. However, we took one additional fold for validation: 8 folds are used as training data, one is used as validation and the remaining one for test (trainingvalidationtest ratio is 0.80.10.1). Validation fold is one index less than the test one (for example, when the test fold is 9, the validation fold is 8). Performance is measured in terms of classification accuracy. In all experiments, the dataset is augmented through pitch shifting [34], with both positive (up) and negative (down) semitones with values 2,1,1,2.
Layer  VGGish/M  M  M  M  M 

Conv1  64  64  32  8  4 
Pool1  x  x  x  x  x 
Conv2  128  128  64  16  8 
Pool2  x  x  x  x  x 
Conv3  256  256  128  32  16 
Conv4  256         
Pool3  x  x  x  x  x 
Conv5  512  256  128  64  16 
Conv6  512         
Pool4  x  x  x  x  x 
Conv7        64  32 
Pool5        x  x 
FC1  4096  2048  512  256  64 
FC2  4096  2048       
FC3  128  128  128  128  128 
BatchNorm  x  x  x  x  x 
GRU  20  20  20  20  20 
FC4  10  10  10  10  10 
#Param  ~72.1M  ~18.0M  ~1.88M  ~202k  ~30.6k 
IiiB Teacher
Stateoftheart solutions for SED are mostly CNN fed with melspectrogram [27][44][42]. However, to fully exploit the potential of the KD approach, rather than training from scratch our big CNN, we employed the publicly available VGGish^{2}^{2}2”https://github.com/tensorflow/models/tree/master/research/audioset” feature extractor [13], followed by a classification stage tailored on UrbanSound8K. VGGish is trained on the Youtube8M Dataset [11] and it is expected to generalize well to other application contexts. Note that this fact introduces a further novelty in our work since distillation is performed on a different dataset than that used in the original training. VGGish converts audio input mel spectrogram into a 128 dimensional embedding that can be used as input for a further classification model. The classifier can be shallow as the VGGish embeddings are more semantically representative than raw audio features. The VGGish architecture is described in Table I
. For all the convolutional layers the kernel size is 3, the stride is 1 and the activation function is ReLu. Max pooling layers are implemented with a 2x2 kernel and a stride of 2.
The classifier consists of a Gated Recurrent Unit (GRU) followed by a fully connected layers and maps the VGGish embeddings into the 10 classes of UrbanSound8K. A BatchNormalization layer is inserted between the feature extractor and the classifier to accelerate training.
VGGish expects as input of audio signal sampled at . Each clip in UrbanSound8K
is resampled and padded or cropped to get a length of 960*4=
. The resulting signal is divided into 4 nonoverlappingframes. For each frame, the shorttime Fourier transform is computed on
windows every . The resulting spectrogram is integrated into 64 melspaced frequency bins, covering the range 1257500 Hz, and logtransformed. This gives 4 patches of 96 x 64 bins that form the input of the VGGish.IiiC Distillation
In this section, we analyze how the compression factor impacts on the final classification accuracy, using the standard loss and the loss compound described above. Although our goal is to fit the classifier on an IoT device, we consider four different degrees of compression, to better assess the potential and limits of the proposed approach, as reported in Table I
. An heuristic adjustment of the layers and the number of filters is used to design the reduced networks. The model subscript M
approximately represents the number of parameters of the upstream part. Note that the student networks drastically reduce the feature extractor only. The classifier is retrained, but it keeps the same architecture, i.e. a 20unit recurrent layer followed by a 10unit fully connected layers.Figure 3 reports the results in terms of classification accuracy for the 4 models and when training from scratch on UrbanSound8K and when distilling from the VGGish teacher. In addition, we consider the model M, which is a replica of the VGGish’s architecture. All models considerably improve thanks to knowledge distillation with respect to being trained from scratch. Note also how the large models (M, M) perform much worse than VGGish and slightly worse than M when trained from scratch: this is due to the fact that UrbanSound8K is not large enough for such huge networks. Finally, it is worth noting that the VGGish baseline is outperformed also by all models with more than 2M parameters. This is mostly due to the domain adaptation that we are implicitly applying when distilling knowledge from the teacher. As a matter of fact, the student feature extractor is tailored to the new indomain data from UrbanSound8K, providing an improvement over the more general purpose VGGish feature extractor trained on Youtube8M.
IiiD Twostage distillation
The previous analysis, showing that we are jointly performing domain adaptation and parameter reduction, suggested us to investigate whether a further improvement can be achieved by separating these two processes.
Figure 4 describes our proposed 2stage distillation, where we train the smallest network using the M network as teacher. Note that M is still trained using the compound loss in Eq. 3. Results are reported in Table II in comparison against training from scratch and using distillation from VGGish. The proposed approach achieves a 72.67% accuracy on the test set, providing a 3 points improvement over a more traditional distillation strategy (69.7%). Interestingly, the new M model is just 2 points below the VGGish baseline and 4 points below M, in spite of using less than 0.0424% of the parameter and 0.12% of the operation.
Standard Training  From VGGish  From M  

M  61.83  69.72  72.67 
Iv Hardware resources and network requirements
Iva Approximate hardware requirements per model
One interesting aspect towards tayloring our distilled networks for a resource constrained platform is to understand what are the computational and hardware requirements for a given model, or to understand what accuracy would be achievable with a given platform.
VGGish  M  M  M  M  

#Param  ~72.1M  ~18.0M  ~1.88M  ~202k  ~30.6k 
#Operations  ~1.72G  ~608M  ~148M  ~13.6M  ~2.11M 
#Buffer [B]  ~614k  ~602k  ~301k  ~76.0k  ~34.3k 
Table III reports an approximation of the computational requirements of each network: number of parameters to store, number of operations and buffer sizes. For memory requirements, we refer to an implementation using 8bit quantized weights and buffers as specified in the CMSISNN library.
With an 8bit quantization, each parameter takes one byte; therefore, the nonvolatile memory in bytes needed by a model equals the total number of weights and biases (first row of Table III).
Buffers keep the outputs of each layer available during propagation and are stored in the RAM. The size of these buffers depends on the output dimensionality of each layer. However, the total amount of runtime memory required depends heavily on how efficiently buffers are implemented. The third row of Table III reports the RAM requirements of each model given the buffer design described in section V.
The number of operations (both multiplications and sums) depends on the layer type: in convolutional and maxpooling layers, each output pixel comes from a filter application. Given kernel size and number of input channels , each filter requires the following operations:
(5) 
Therefore, the number of operations for each convolutional and maxpooling layer is:
(6) 
where and are the output height, width an channel respectively.
For dense and GRU layers the number of operations is the number of matrix multiplications:
(7) 
Gated recurrent unit requires also three elementwise products.
IvB Selection of the device class
Table IV shows a nonexhaustive list of processing platforms potentially adequate to be integrated into an enddevice, with their power consumption and memory. In this section, we provide a qualitative analysis of what devices would be able to run the distilled models presented in the previous sections. This analysis is inevitably rough as many figures are approximated and the actual values cannot be derived analytically. Additionally, we are not considering the time and resources required for other processing stages (e.g., Melbins extraction); thus, the constraints are rather relaxed with respect to the actual application. Nevertheless, the devices in Table IV show substantial differences in Million Instructions Per Second (MIPS) and memory (in the order of powers of 10); therefore, the results of this study are still valid as long as we refer to classes of devices.
Board Name  Flash[KB]  RAM[KB]  Power [mW]  MIPS 

Arduino  32  2  60  20 
ChipKit uc32  512  32  181  124.8 
STM32L476RG  1024  128  26  80 
TI MSP432P4111  2048  256  23  58.56 
BeagleBone Black  Ext  524288  2300  1607 
Raspberry Pi 3 B+  Ext  1048576  5500  2800 
Computational Cost. To ensure real time classification, the network must process each 960 ms audio frame (~1 second) before the next frame arrives. Thus, classification time must be shorter than 1 second. Unfortunately, converting exactly the Million Operation Per Second (MOPS) required by each model in the MIPS available on a given device, as reported in the datasheet, is not feasible. As a matter of facts, the number of instructions required by an operation depends on many factors, the most important being the actual implementation. In this analysis, we rely on the assumption that, on average, the number of operations is equal to the number of instructions. Typically, instructions are more than operations because each operation involves branch, load and store instructions. However, most of 32bit microcontrollers support Single Instruction Multiple Data (SIMD), which allows up to four instructions in one clock cycle. We assume that these two effects are balanced and the number of operations is equal to the number of instructions. In the next sections, we will demonstrate that the actual throughput (operation per instruction) is larger than 1, but it gets close enough (1.8) in some situations. Given this assumption, in combination with the fact that other processing stages are not being accounted for, the MIPS available on the device must be larger (with some margin) than the MOPS needed in one forward propagation.
Memory constraints are also relevant: the RAM onboard must contain the intermediate values and the nonvolatile memory should contain all network parameters. However, nonvolatile memory is not the main limit in our examples: whenever the network does not fit in the flash memory it does not respect one of the others two parameters (RAM or MIPS).
Figure 5 roughly compares the devices in Table IV in terms of RAM and MIPS limitations against the compressed models each device can afford.
None of the proposed models fits in the Arduino platform due to its limited RAM. The last two platforms of Table IV (BeagleBone Black and Raspberry Pi 3 B+) have enough RAM and MIPS to handle all models (actually VGGish fits only in Raspberrypi 3 B+). However, their power consumption is in the range of Watts that is not suitable for IoT applications. Therefore only devices ChipKit uc32, STM32L476RG, and TI MSP432P4111, which are approximately in the same class, are potentially suitable to host the distilled models. Uc32 is faster than both STM32 and TI platforms and would comfortably run the two smallest models in realtime, but its 32KB RAM is not large enough to fit even the ~ needed for the buffers of the smallest model M. Concluding, M and M are the only suitable models for the target application and can be implemented in both the STM32L476RG and TI MSP432P4111 platforms. In the next section, we detail our implementation of M in the STM32L476RG platform.
V embedded programming
For the implementation of our SED system, we selected a Cortex M4 architecture operating up to , which is a good compromise between efficiency in processing and flexibility in power management. In particular, we worked on an STM3276RG Nucleo board as a development platform. Thus, the reference values for current consumption are in standby mode and in full running mode. The Floating Point Unit (FPU) features single precision and implements a full set of optimized Digita Signal Processing (DSP) instructions for fixedpoint operations.
CortexM4 provides SIMD instructions that operate on 8 or 16bit integers. They are powerful for processing data such as video or audio, when full 32bit precision is not necessarily required. The SIMD instructions allow 2 x 16 bit or 4 x 8 bit operations to be performed in parallel [21].
Va Quantization
Feedforward propagation through a neural network requires vector/matrix/tensor multiplication and convolution. Therefore, all the core features employed in signal processing can be used for neural network computations. For this reason, ARM developed the CMSISNN framework for neural network propagation on top of DSP libraries [17]. The CMSISNN library maximizes performance and energy efficiency of common deep learning kernels on top of CortexM series cores.
Like for DSP, truncation of floating point numbers to 8 or 16bit fixed point numbers improves the execution time and reduces the memory footprint. According to [17], 8bit quantization achieves 4.6X improvement in runtime/throughput and 4.9X improvement in energy efficiency in image classification with CIFAR10 dataset. On the other hand, since quantization implies loss of precision, we could expect a direct impact on the final prediction performance. However, the authors of [20] experimented different kinds of quantization in image classification, achieving a 20% drop in model size without significant loss of accuracy. In the following, we describe the quantization process from a 32bit floatingpoint to an 8bit fixedpoint representation. From now on, we will assume that floating point numbers has infinite precision and we will use the nomenclature typically used for DSP.
Quantization describes a real number with finite resolution. When a uniform quantization is applied, three parameters are used to define the fixed point representation: bitwidth, stepsize (resolution) and dynamic range [20]. These parameters are correlated by the following expression:
(8) 
where accounts for the bit used to represent the sign. is the minimum step between two fixed point numbers and will be always chosen as a power of two for convenience with binary representation.
An equivalent formulation of Eq. 8 can be obtained by considering the number of bits used for the decimal and integer parts of numbers:
where, is the number of bits used to represent the integer and is the number of bits used for the decimal part. +1 in the last equation accounts for the sign bit. Basically, given a fixed number of bits, increasing the range by increasing the number of bits for the integer part degrades the resolution. Conversely, decreasing , by allocating more bits to the decimal part, leads to a range reduction.
The quantization error is the difference between the infinite precision number and the quantized representation:
(9) 
where and
are the input and the quantized output. If we treat the input as a random variable with a probability density function
(see [37]), the mean square quantization error is the combination of two different errors, namely granular error () and overload error ():where is the number of quantization levels (in our case ) and are the decision levels, i.e. any number between and is coded with the same fixedpoint representation, usually . The two mean square errors are related to and respectively. If is reduced then MSE decreases; on the other hand if is made wider, MSE decreases.
The figure of merit linked with Mean Square Error (MSE) is the Signal to QuantizationNoise Ratio (SQNR), defined as:
(10) 
Therefore, the goal is to define the optimum tradeoff between and to minimize the SQNR in the target application. In the following, we describe two different strategies to perform quantization using the quantization parameters and its figure of merit described above.
VB Quantization Design
Applying quantization to neural networks has different requirements with respect to other contexts, as image quantization or audio quantization. In particular, we are not interested in preserving an accurate representation of all activation outputs or weights for each layer. Conversely, we want that the final prediction is as close as possible to the prediction of the network with a 32bit floatingpoint representation. To summarize, accuracy is the most relevant metric.
An exhaustive search of all the possible / combinations, evaluating the final accuracy, is not feasible in practice. In the smallest network, i.e. M, the number of / digits would be 28, and each of them can vary between 0 and 7 (setting the decimal digit decides the integer part when is fixed). Therefore, the number of possible combinations is given by the permutation with repetition: .
The two solutions presented in this paper target the maximization of the SQNR or the reduction of the overload error.
The first approach to compute SQNR applies quantization on each variable (weights, activation) one at a time. For the weights, we choose the number of decimals that maximizes the SQNR. For the intermediate outputs we compute the SQNR running a forward propagation in floatingpoint on the training dataset and testing all the possible integer/decimal values (in the range from 0 to ). Each activation output is analyzed independently, eventually leading to different quantization tradeoffs.
This approach relies on [20], where the overall SQNR, here defined as
, is the harmonic mean of the
SQNR of all preceding quantization steps:(11) 
It turns out that maximizing the single SQNRs maximizes the overall SQNR, regardless of where quantization happens (at the first layers or at the last layers).
The second approach is based on the different effects of granular and overload errors. It is reasonable that a single number with a high quantization error due to overload will affect the overall forwarding of the neural network. Conversely, the ensemble of small granular quantization error may not change the argmax of softmax layer, that is the value used to determine prediction and accuracy. Therefore, we select the integer/decimal ratio that reduces the probability of having values out of the quantization range. In particular, we set the number of bits for the integer part
such as(12) 
where is the floatingpoint input and is the probability of overload that one is willing to accept. This approach differs from the previous one for three reasons. (i) Numbers with large overload error have of the same importance than numbers with small overload error. (ii) Thus, each overload is considered uniformly, without accounting for the granular error but taking into only the overload error. (iii) This second approach has a hyperparameter, i.e. the probability threshold, which requires a fine tuning on the training set. We heuristically set the threshold to .
We estimate the probability density function
using the whole training set. The percentage of values out of the range is counted for all the possible decimal values involved in the network feedforward propagation.The upper part of Figure 5 is the histogram of the absolute value of the activation outputs between two consecutive powers of two. Most of the values (more than ) are between 1 and 1 (or equivalently ). Oppositely, just a small set of numbers (around ) are such that ; they are depicted in the fourth bar. The central part of Figure 6 depicts the relative distrbution for each class of values, e.g. the estimated probability that a value is inside a given range related to two consecutive powers of 2. The values confirm what bars show: values in are the most likely and values in appears only with a probability of . Finally, the plot at the bottom depicts the SQNR using the number of bits for the decimal part available for a given range in the x axis (the range determines the bits for the integer part). For example, the most left point is the SQNR using 0 bits for integer part (which allows describing numbers in the range ) and 7 bits for the decimal part. In this particular case, corresponding to a layer of the network, the optimum number of bits for the integer part obtained by maximizing the SQNR (three) does not correspond to what obtained by setting the probability threshold to (four).
VC Firmware Programming
To transfer the neural model from to CMSISNN we developed an automatic python script to export the Keras model in an header file containing the weights to use in the firmware. To do so, firstly, we reordered the weights following the convention of both source and destination framework, then we quantized the weights using both quantization schemes. In this work, we set
to 8 since we privilege execution time and power consumption with respect to accuracy. Note that we implemented a slightly different version of the M model, replacing the GRU layer with a vanilla Recurrent Neural Network (RNN). The reason is that the implementation of a GRU layer in CMSISNN requires an additional effort while the accuracy does not change significantly.The CMSISNN functions require different arguments: input dimension, number of channel and so on. These parameters should be extracted from the neural network and loaded in the microcontroller. Some of them requires extra processing, like shifting of weights and bias in internal operations. A detailed description of these parameters follows here.
Suppose that we are in an intermediate layer and that the quantization procedure concludes that the number of integer bits and the number of decimal bits for weights, bias, input and output are , , , , , , , . The neural network operations include always a linear combination of weights and input, so that
In multiplications between fixed point numbers, the number of decimal digits of the result are given by the sum of the number of decimal digits of the two operands. To sum the bias to this intermediate value, we need to use the same number of decimals. Thus, we shift the bias to make it match. In most of the cases it is a left shift. Finally, the output must have a certain number of decimals to get back in a fixed bitwidth format, so we apply a further shift. This last shift is opposite to the previous one and in CMSISNN is referred to as right shift. These concepts are expressed in formulas, implemented in the header files by means of a macro:
Another parameter to bear in mind is the size of the buffers. To necessarily instantiate buffers in the program, they need to be known in advance. During inference, intermediate values are discarded, and only the final prediction and the network state (in case of recurrent neural network) are kept. As a consequence, it is possible to use the same buffers in multiple layers.
I  C1  C2  C3  C4  C5  

size  6144  23312  10440  4368  720  … 
buffer  A  B  A  B  A  B 
buffers size required for each layer. Reuse of buffers allows memory saving. Two buffers (A,B) must contain maximum value in odd index and even index of size
respectively.Each layer needs a pair of buffers able to contain output and input. To find the minimum size for these two buffers, we create a vector with the number of element between layers, , where is the input size. For each layer, the 2 buffers will switch their role: in the first convolutional layer, the input will be and the output . For the next layer, is the input, is the output and so on. Table V shows the sequence of input and output for each layer in the implemented network. It shows that the two buffer sizes should be selected accounting for the maximum in odd and even indexes in the vector.
Finally, the CMSISNN framework requires also an additional small buffer for intermediate calculations, which can be reused in the implementation too.
Vi Results
We evaluate the porting of our SED model to the microcontroller, in terms of power consumption, execution and recognition accuracy.
Via Accuracy
Input data from the testset are sent by UART to the MicroController Unit (MCU) using a Python script. The forward propagation is computed inside the microcontroller that provides the prediction results on the same bus for accuracy evaluation. Following the recipe in [35], we used a 10fold cross validation and average scores. For each test fold, we load models in the microcontroller with the quantization parameters computed on the related training set. Figure 7 depicts the accuracy over the 10 folds. The average accuracy shows a decrease of performance in both quantization schemes, but it is limited to a 2% drop overall if compared to the floatingpoint version (pink). Note that this minor performance drop could be limited by finetuning the quantized network [41].
Looking into more details, this performance deterioration is strongly dependent on the train/testset split of each fold. Figure 8 reports the performance for three folds (train/test configurations): fold 7, fold 10, fold 2. When testing on fold 7, the performance deterioration is in line with the average accuracy decreases of around 2%. On the other hand, when fold 10 is selected as a test set, the accuracy drop is more consistent (8%). Finally, the impact of quantization error becomes negligible, or it even improves the classification results, in fold 2.
This behaviour is related to the robustness of the original floating point model. As pointed out by Piczak [27], some classes are more difficult to detect by a convolutional neural network, because of their short scale temporal structure (drilling, engine idling, jackhammer). Whenever the floating point network correctly classifies these classes, the difference between the likelihoods is probably shallow and the errors introduced by quantization can lead to a misclassification, leading to huge performance degradation. Similarly, the gap between the floating point and quantized network decreases or disappear when the reference network already does not classify correctly the difficult classes, as performance is already bad.
To confirm that, we compared the accuracy gap due to quantization and the F1 for these problematic classes. In all three cases, we confirmed that whenever the F1 score is high the accuracy drop increases, following an exponential profile as Figure 9 shows. The figure refers to the class ”Jackhammer”.
ViB Execution Time and Power Consumption
In section IIIA, we estimated the execution time of several network architectures by assuming that the number of operations is equivalent to the number of instructions, this way allowing a comparison between different platforms in terms of MIPS available from the datasheets. In this section, we evaluate the actual execution time and the power consumption of our implementation of the smallest model M. Measurements are done on the real prototype using the same system used to compute the accuracy. We measure the performance layer by layer to understand how each component contributes to the overall processing time. Table VI reports the evaluation results.
Layer  Output Channel  #kop  Exec. Time [ms]  MOPS 

Input  1       
Conv1  4  419.61  63.93  6.56 
Pool1  x  23.31  10.17  2.29 
Conv2  8  751.68  20.13  37.34 
Pool2  x  24.84  8.32  2.99 
Conv3  16  628.99  12.90  48.76 
Pool3  x  11.09  3.63  3.03 
Conv4  16  207.36  3.88  53.44 
Pool4  x  2.16  0.69  3.15 
Conv5  32  27.65  0.60  46.23 
Pool5  x  0.58  0.17  3.31 
FC1  64  8.19  0.22  38,10 
FC2  128  16.38  0.42  38.73 
RNN  60  22.56  0.55  40,58 
FC3  10  1.20  0.04  33.33 
Total    2145.61  125.6  17.08 
Plain C    2145.61  291.4  7.36 
The CMSISNN framework implements a ”basic” and a ”fast” version of convolutional layers. The latter uses assembly directives to speedup execution, especially by means of MultiplyACcumulate (MAC) and SIMD. The only constraint for using this faster implementation is that the number of channels must be a multiple of 4 for 8bit fixedpoint quantization. This is the reason why, the number of channels of the convolutional layers of all architectures described in Table I are multiple of 4. Problems arise on the first layer, whose input is the Mel spectrum that has just one channel. This does not allow us to use the optimized version for convolutional layers and for this reason the first convolutional layer takes more than half of overall execution time, with a throughput of 6.56 MOPS (see line 2 of Table VI). The second convolutional layer implements the fast version and takes just , even if it is more computationally intense, performing 37.34 MOPS. This highlights how important is the parallelization and how much a proper parallelization in the first layer can reduce drastically the overall execution time.
To stress more the importance of an efficient framework for deep neural networks, we used a reference implementation of plain C (without explicit SIMD directives) for convolutional, maxpooling and fully connected layers. The comparison is in the last two lines of Table VI. The total execution time is speedup of x2.32 with the fast implementation of CMSISNN with respect to plain C.
In Section IV, we stated that the combination of x4 parallelization and overhead, due to branches and loadstore instructions, makes the microcontroller able to perform an 8bit operation in just one instruction. The selected platform executes 80 MIPS, thus we expect 80 MOPS, but the overall throughput is different in realtime measurements (average 17 MOPS). Looking at the peaks in layer conv4, 53.44 MOPS is still far from our estimation of 80 MOPS. It means that parallelization does not fully compensate the overhead due to loadstore and branches and we need approximately two instructions to execute an operation. The average (17.08 MOPS) is far from this peak level, mainly because of the first layer, that is the dominant part and it is not fully parallelized.
Vii Conclusions
In this work, we described the whole process from a stateoftheart model for sound event detection to its energy efficient implementation in a microcontroller, targeting IoT applications. Firstly, we demonstrated that knowledge distillation can be effective also for extreme compression rates, achieving models suitable for real time applications on IoT nodes. Then, we introduced a twostep distillation to further improve the performance of the student network. Furthermore, we moved to the description of two quantization strategies, concluding that they perform in a similar way. Maximization of SQNR
is generally preferred with respect to the probabilistic approach, because it does not require any hyperparameter. Both 8bit quantization schemes were applied to the smallest distilled model resulting in a 2 percent points drop in accuracy, in comparison with the original floating point version. The final implementation on the microcontroller has a propagation time of 125 ms for each 1secondaudioclip using just 5.5 mW average power and 34.3 kB of RAM. We have shown that an efficient framework for neural network, like CMSISNN, speedup significantly the execution.
One interesting extension of our work would be to combine the distillation and quantization step, using the soft label to train the quantized networks. Finally, in future works we will implement the whole chain on an IoT node, including sensor acquisition, feature extraction and transmission of the classification outcome to the cloud.
References
 [1] (2016) An analysis of deep neural network models for practical applications. arXiv preprint arXiv:1605.07678. Cited by: §II.
 [2] (2019) Neural network distillation on IoT platforms for sound event detection. In Interspeech, pp. . Cited by: §I, §I, §III, §III.
 [3] (2019) Convolutional neural network on embedded platform for people presence detection in low resolution thermal images. In ICASSP, pp. 7610–7614. Cited by: §II.
 [4] (2017Sep.) An IoT endpoint systemonchip for secure and energyefficient nearsensor analytics. CAS 64 (9), pp. 2481–2494. External Links: Document, ISSN 15498328 Cited by: §I.
 [5] (2015) BinaryConnect: training deep neural networks with binary weights during propagations. In NIPS, pp. 3123–3131. Cited by: §II.
 [6] (2019) Look, listen, and learn more: design choices for deep audio embeddings. In ICASSP, pp. 3852–3856. Cited by: §II.
 [7] (2013) Predicting parameters in deep learning. In International Conference on Neural Information Processing Systems, pp. 2148–2156. Cited by: §II.
 [8] (2014) Exploiting linear structure within convolutional networks for efficient evaluation. In NIPS, pp. 1269–1277. Cited by: §II.
 [9] (201509) Edgecentric computing: vision and challenges. SIGCOMM 45 (5), pp. 37–42. External Links: ISSN 01464833 Cited by: §I.
 [10] (201703) Audio set: an ontology and humanlabeled dataset for audio events. In ICASSP, Vol. , pp. 776–780. External Links: ISSN Cited by: §I, §IIIA.
 [11] (2017) Audio set: an ontology and humanlabeled dataset for audio events. In ICASSP, pp. 776–780. Cited by: §IIIB.
 [12] (1993) Optimal brain surgeon and general network pruning. In IJCNN, pp. 293–299. Cited by: §II.
 [13] (2017) CNN architectures for largescale audio classification. In ICASSP, pp. 131–135. Cited by: §I, §IIIB.
 [14] (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §I, §II, §III.
 [15] (2013) Towards the implementation of IoT for environmental condition monitoring in homes. IEEE sensors journal 13 (10), pp. 3846–3853. Cited by: §II.
 [16] (2019) EdgeL3: compressing L3Net for motescale urban noise monitoring. In IEEE International Parallel and Distributed Processing Symposium Workshops, Cited by: §II.
 [17] (2018) CMSISNN: efficient neural network kernels for arm cortexm cpus. arXiv preprint arXiv:1801.06601. Cited by: §II, §VA, §VA.
 [18] (1990) Advances in neural information processing systems 2. D. S. Touretzky (Ed.), pp. 598–605. External Links: ISBN 1558601007 Cited by: §I.
 [19] (2018) Extremely low bit neural network: squeeze the last bit out with ADMM. In Conference on Artificial Intelligence, Cited by: §II.
 [20] (2016) Fixed point quantization of deep convolutional networks. In ICML, pp. 2849–2858. Cited by: §VA, §VA, §VB.
 [21] (201611) The dsp capabilities of arm cortexm4 and cortexm7 processors. Technical report ARM. Cited by: §V.
 [22] (2017) DCASE 2017 challenge setup: tasks, datasets and baseline system. In DCASE, Cited by: §I, §IIIA.
 [23] (2010) Acoustic event detection in real life recordings. In European Signal Processing Conference, Vol. , pp. 1267–1271. Cited by: §I.
 [24] (2016) TUT database for acoustic scene classification and sound event detection. In EUSIPCO, Cited by: §I, §IIIA.
 [25] (2015) Tensorizing neural networks. In NIPS, pp. 442–450. Cited by: §II.
 [26] (2018) Ultra low power deeplearningpowered autonomous nano drones. In IEEE IROS, Cited by: §II.
 [27] (2015) Environmental sound classification with convolutional neural networks. In MLSP, pp. 1–6. Cited by: §IIIB, §VIA.
 [28] (2015) ESC: dataset for environmental sound classification. In ACM international conference on Multimedia, pp. 1015–1018. Cited by: §I, §IIIA.
 [29] (2017) Plug into a plant: using a plant microbial fuel cell and a wakeup radio for an energy neutral sensing system. In IEEE 42nd LCN Workshops, pp. 18–25. Cited by: §II.
 [30] (2018) Deep residual learning for smallfootprint keyword spotting. In ICASSP, pp. 5484–5488. Cited by: §II, §II.
 [31] (2016) XNORNet: imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, Vol. 9908, pp. 525–542. External Links: ISBN 9783319464923, Document Cited by: §II.
 [32] (2014) Fitnets: hints for thin deep nets. arXiv preprint arXiv:1412.6550. Cited by: §III.
 [33] (201710) A submw IoTendnode for alwayson visual monitoring and smart triggering. IoTJ 4 (5), pp. 1284–1295. External Links: Document, ISSN 23274662 Cited by: §I.
 [34] (2017) Deep convolutional neural networks and data augmentation for environmental sound classification. IEEE Signal Processing Letters 24 (3), pp. 279–283. Cited by: §IIIA.
 [35] (2014) A dataset and taxonomy for urban sound research. In MM, pp. 1041–1044. Cited by: §I, §IIIA, §VIA.
 [36] (2016) In teacher we trust: learning compressed models for pedestrian detection. arXiv preprint arXiv:1612.00478. Cited by: §II.
 [37] (2008) Image and video compression for multimedia engineering: fundamentals, algorithms, and standards. CRC press. Cited by: §VA.
 [38] (2015) Convolutional neural networks for smallfootprint keyword spotting. In INTERSPEECH, pp. 1478–1482. Cited by: §II, §II.
 [39] (2007) CLEAR evaluation of acoustic event detection and classification systems. In CLEAR, pp. 311–322. Cited by: §I.

[40]
(2016)
Quantized convolutional neural networks for mobile devices.
Conference on Computer Vision and Pattern Recognition
, pp. 4820–4828. Cited by: §II.  [41] (2019) KCNN: kernelwise quantization to remarkably decrease multiplications in convolutional neural network. In International Joint Conference on Artificial Intelligence, pp. 4234–4242. Cited by: §VIA.
 [42] (2017) Dilated convolution neural network with leakyrelu for environmental sound classification. In DSP, pp. 1–5. Cited by: §IIIB.
 [43] (2017) Hello edge: keyword spotting on microcontrollers. arXiv preprint arXiv:1711.07128. Cited by: §II, §II.
 [44] (2018) Deep convolutional neural network with mixup for environmental sound classification. In PRCV, pp. 356–367. Cited by: §IIIB.
 [45] (2010) Realworld acoustic event detection. Pattern Recognition Letters 31 (12), pp. 1543 – 1551. Cited by: §I.