Multi-scale Octave Convolutions for Robust Speech Recognition

10/31/2019
by   Joanna Rownicka, et al.
0

We propose a multi-scale octave convolution layer to learn robust speech representations efficiently. Octave convolutions were introduced by Chen et al [1] in the computer vision field to reduce the spatial redundancy of the feature maps by decomposing the output of a convolutional layer into feature maps at two different spatial resolutions, one octave apart. This approach improved the efficiency as well as the accuracy of the CNN models. The accuracy gain was attributed to the enlargement of the receptive field in the original input space. We argue that octave convolutions likewise improve the robustness of learned representations due to the use of average pooling in the lower resolution group, acting as a low-pass filter. We test this hypothesis by evaluating on two noisy speech corpora - Aurora-4 and AMI. We extend the octave convolution concept to multiple resolution groups and multiple octaves. To evaluate the robustness of the inferred representations, we report the similarity between clean and noisy encodings using an affine projection loss as a proxy robustness measure. The results show that proposed method reduces the WER by up to 6.6 computational efficiency of the CNN acoustic models.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/10/2019

Drop an Octave: Reducing Spatial Redundancy in Convolutional Neural Networks with Octave Convolution

In natural images, information is conveyed at different frequencies wher...
research
07/06/2019

Multi-level Wavelet Convolutional Neural Networks

In computer vision, convolutional networks (CNNs) often adopts pooling t...
research
06/10/2021

Group Equivariant Subsampling

Subsampling is used in convolutional neural networks (CNNs) in the form ...
research
09/26/2021

Group Shift Pointwise Convolution for Volumetric Medical Image Segmentation

Recent studies have witnessed the effectiveness of 3D convolutions on se...
research
05/18/2018

Multi-level Wavelet-CNN for Image Restoration

The tradeoff between receptive field size and efficiency is a crucial is...
research
11/28/2016

Dense Prediction on Sequences with Time-Dilated Convolutions for Speech Recognition

In computer vision pixelwise dense prediction is the task of predicting ...
research
07/25/2023

Exploring the Sharpened Cosine Similarity

Convolutional layers have long served as the primary workhorse for image...

Please sign up or login with your details

Forgot password? Click here to reset