Hybrid Transformer/CTC Networks for Hardware Efficient Voice Triggering

08/05/2020
by   Saurabh Adya, et al.
0

We consider the design of two-pass voice trigger detection systems. We focus on the networks in the second pass that are used to re-score candidate segments obtained from the first-pass. Our baseline is an acoustic model(AM), with BiLSTM layers, trained by minimizing the CTC loss. We replace the BiLSTM layers with self-attention layers. Results on internal evaluation sets show that self-attention networks yield better accuracy while requiring fewer parameters. We add an auto-regressive decoder network on top of the self-attention layers and jointly minimize the CTC loss on the encoder and the cross-entropy loss on the decoder. This design yields further improvements over the baseline. We retrain all the models above in a multi-task learning(MTL) setting, where one branch of a shared network is trained as an AM, while the second branch classifies the whole sequence to be true-trigger or not. Results demonstrate that networks with self-attention layers yield ∼60 false reject rates for a given false-alarm rate, while requiring 10 parameters. When trained in the MTL setup, self-attention networks yield further accuracy improvements. On-device measurements show that we observe 70 relative reduction in inference time. Additionally, the proposed network architectures are ∼5X faster to train.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/23/2019

A Transformer with Interleaved Self-attention and Convolution for Hybrid Acoustic Models

Transformer with self-attention has achieved great success in the area o...
research
05/02/2018

Accelerating Neural Transformer via an Average Attention Network

With parallelizable attention networks, the neural Transformer is very f...
research
05/28/2020

When Can Self-Attention Be Replaced by Feed Forward Layers?

Recently, self-attention models such as Transformers have given competit...
research
01/26/2020

Multi-task Learning for Speaker Verification and Voice Trigger Detection

Automatic speech transcription and speaker recognition are usually treat...
research
02/18/2022

Deep-Learning Architectures for Multi-Pitch Estimation: Towards Reliable Evaluation

Extracting pitch information from music recordings is a challenging but ...
research
01/26/2020

Multi-task Learning for Voice Trigger Detection

We describe the design of a voice trigger detection system for smart spe...
research
03/31/2023

Practical Conformer: Optimizing size, speed and flops of Conformer for on-Device and cloud ASR

Conformer models maintain a large number of internal states, the vast ma...

Please sign up or login with your details

Forgot password? Click here to reset