A Study of Designing Compact Audio-Visual Wake Word Spotting System Based on Iterative Fine-Tuning in Neural Network Pruning

02/17/2022
by   Hengshun Zhou, et al.
2

Audio-only-based wake word spotting (WWS) is challenging under noisy conditions due to environmental interference in signal transmission. In this paper, we investigate on designing a compact audio-visual WWS system by utilizing visual information to alleviate the degradation. Specifically, in order to use visual information, we first encode the detected lips to fixed-size vectors with MobileNet and concatenate them with acoustic features followed by the fusion network for WWS. However, the audio-visual model based on neural networks requires a large footprint and a high computational complexity. To meet the application requirements, we introduce a neural network pruning strategy via the lottery ticket hypothesis in an iterative fine-tuning manner (LTH-IF), to the single-modal and multi-modal models, respectively. Tested on our in-house corpus for audio-visual WWS in a home TV scene, the proposed audio-visual system achieves significant performance improvements over the single-modality (audio-only or video-only) system under different noisy conditions. Moreover, LTH-IF pruning can largely reduce the network parameters and computations with no degradation of WWS performance, leading to a potential product solution for the TV wake-up scenario.

READ FULL TEXT
research
03/11/2023

The Multimodal Information based Speech Processing (MISP) 2022 Challenge: Audio-Visual Diarization and Recognition

The Multi-modal Information based Speech Processing (MISP) challenge aim...
research
07/28/2021

Squeeze-Excitation Convolutional Recurrent Neural Networks for Audio-Visual Scene Classification

The use of multiple and semantically correlated sources can provide comp...
research
06/14/2019

Towards Compact and Robust Deep Neural Networks

Deep neural networks have achieved impressive performance in many applic...
research
12/15/2022

MAViL: Masked Audio-Video Learners

We present Masked Audio-Video Learners (MAViL) to train audio-visual rep...
research
09/10/2021

Large-vocabulary Audio-visual Speech Recognition in Noisy Environments

Audio-visual speech recognition (AVSR) can effectively and significantly...
research
09/02/2020

Seeing wake words: Audio-visual Keyword Spotting

The goal of this work is to automatically determine whether and when a w...
research
07/11/2020

Look and Listen: A Multi-modality Late Fusion Approach to Scene Classification for Autonomous Machines

The novelty of this study consists in a multi-modality approach to scene...

Please sign up or login with your details

Forgot password? Click here to reset