Understanding Audio Features via Trainable Basis Functions

04/25/2022
by   Kwan Yee Heung, et al.
0

In this paper we explore the possibility of maximizing the information represented in spectrograms by making the spectrogram basis functions trainable. We experiment with two different tasks, namely keyword spotting (KWS) and automatic speech recognition (ASR). For most neural network models, the architecture and hyperparameters are typically fine-tuned and optimized in experiments. Input features, however, are often treated as fixed. In the case of audio, signals can be mainly expressed in two main ways: raw waveforms (time-domain) or spectrograms (time-frequency-domain). In addition, different spectrogram types are often used and tailored to fit different applications. In our experiments, we allow for this tailoring directly as part of the network. Our experimental results show that using trainable basis functions can boost the accuracy of Keyword Spotting (KWS) by 14.2 percentage points, and lower the Phone Error Rate (PER) by 9.5 percentage points. Although models using trainable basis functions become less effective as the model complexity increases, the trained filter shapes could still provide us with insights on which frequency bins are important for that specific task. From our experiments, we can conclude that trainable basis functions are a useful tool to boost the performance when the model complexity is limited.

READ FULL TEXT

page 2

page 4

research
08/09/2021

Time-Frequency Localization Using Deep Convolutional Maxout Neural Network in Persian Speech Recognition

In this paper, a CNN-based structure for the time-frequency localization...
research
03/29/2023

AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot AV-ASR

Audiovisual automatic speech recognition (AV-ASR) aims to improve the ro...
research
01/01/2019

Exploring spectro-temporal features in end-to-end convolutional neural networks

Triangular, overlapping Mel-scaled filters ("f-banks") are the current s...
research
04/25/2022

Cleanformer: A microphone array configuration-invariant, streaming, multichannel neural enhancement frontend for ASR

This work introduces the Cleanformer, a streaming multichannel neural ba...
research
02/27/2021

Spline parameterization of neural network controls for deep learning

Based on the continuous interpretation of deep learning cast as an optim...
research
10/27/2022

SAN: a robust end-to-end ASR model architecture

In this paper, we propose a novel Siamese Adversarial Network (SAN) arch...
research
06/14/2022

Visual Radial Basis Q-Network

While reinforcement learning (RL) from raw images has been largely inves...

Please sign up or login with your details

Forgot password? Click here to reset