WakeUpNet: A Mobile-Transformer based Framework for End-to-End Streaming Voice Trigger

10/06/2022
by   Zixing Zhang, et al.
0

End-to-end models have gradually become the main technical stream for voice trigger, aiming to achieve an utmost prediction accuracy but with a small footprint. In present paper, we propose an end-to-end voice trigger framework, namely WakeupNet, which is basically structured on a Transformer encoder. The purpose of this framework is to explore the context-capturing capability of Transformer, as sequential information is vital for wakeup-word detection. However, the conventional Transformer encoder is too large to fit our task. To address this issue, we introduce different model compression approaches to shrink the vanilla one into a tiny one, called mobile-Transformer. To evaluate the performance of mobile-Transformer, we conduct extensive experiments on a large public-available dataset HiMia. The obtained results indicate that introduced mobile-Transformer significantly outperforms other frequently used models for voice trigger in both clean and noisy scenarios.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/15/2022

Streaming non-autoregressive model for any-to-many voice conversion

Voice conversion models have developed for decades, and current mainstre...
research
06/26/2022

On Comparison of Encoders for Attention based End to End Speech Recognition in Standalone and Rescoring Mode

The streaming automatic speech recognition (ASR) models are more popular...
research
10/22/2020

Developing Real-time Streaming Transformer Transducer for Speech Recognition on Large-scale Dataset

Recently, Transformer based end-to-end models have achieved great succes...
research
03/29/2022

Shifted Chunk Encoder for Transformer Based Streaming End-to-End ASR

Currently, there are mainly three Transformer encoder based streaming En...
research
08/06/2021

An Empirical Study on End-to-End Singing Voice Synthesis with Encoder-Decoder Architectures

With the rapid development of neural network architectures and speech pr...
research
08/08/2020

Stacked 1D convolutional networks for end-to-end small footprint voice trigger detection

We propose a stacked 1D convolutional neural network (S1DCNN) for end-to...
research
06/29/2021

N-Singer: A Non-Autoregressive Korean Singing Voice Synthesis System for Pronunciation Enhancement

Recently, end-to-end Korean singing voice systems have been designed to ...

Please sign up or login with your details

Forgot password? Click here to reset