DeepSpectrumLite: A Power-Efficient Transfer Learning Framework for Embedded Speech and Audio Processing from Decentralised Data

by   Shahin Amiriparian, et al.

Deep neural speech and audio processing systems have a large number of trainable parameters, a relatively complex architecture, and require a vast amount of training data and computational power. These constraints make it more challenging to integrate such systems into embedded devices and utilise them for real-time, real-world applications. We tackle these limitations by introducing DeepSpectrumLite, an open-source, lightweight transfer learning framework for on-device speech and audio recognition using pre-trained image convolutional neural networks (CNNs). The framework creates and augments Mel-spectrogram plots on-the-fly from raw audio signals which are then used to finetune specific pre-trained CNNs for the target classification task. Subsequently, the whole pipeline can be run in real-time with a mean inference lag of 242.0 ms when a DenseNet121 model is used on a consumer-grade Motorola moto e7 plus smartphone. DeepSpectrumLite operates decentralised, eliminating the need for data upload for further processing. By obtaining state-of-the-art results on a set of paralinguistics tasks, we demonstrate the suitability of the proposed transfer learning approach for embedded audio signal processing, even when data is scarce. We provide an extensive command-line interface for users and developers which is comprehensively documented and publicly available at


page 1

page 2

page 3

page 4


Transfer Learning with Jukebox for Music Source Separation

In this work, we demonstrate how to adapt a publicly available pre-train...

Real-time Timbre Transfer and Sound Synthesis using DDSP

Neural audio synthesis is an actively researched topic, having yielded a...

EmoNet: A Transfer Learning Framework for Multi-Corpus Speech Emotion Recognition

In this manuscript, the topic of multi-corpus Speech Emotion Recognition...

Cross-domain Neural Pitch and Periodicity Estimation

Pitch is a foundational aspect of our perception of audio signals. Pitch...

SqueezeWave: Extremely Lightweight Vocoders for On-device Speech Synthesis

Automatic speech synthesis is a challenging task that is becoming increa...

Pipeline for recording datasets and running neural networks on the Bela embedded hardware platform

Deploying deep learning models on embedded devices is an arduous task: o...

Audio Tagging on an Embedded Hardware Platform

Convolutional neural networks (CNNs) have exhibited state-of-the-art per...

Please sign up or login with your details

Forgot password? Click here to reset