A study of latent monotonic attention variants

03/30/2021
by   Albert Zeyer, et al.
0

End-to-end models reach state-of-the-art performance for speech recognition, but global soft attention is not monotonic, which might lead to convergence problems, to instability, to bad generalisation, cannot be used for online streaming, and is also inefficient in calculation. Monotonicity can potentially fix all of this. There are several ad-hoc solutions or heuristics to introduce monotonicity, but a principled introduction is rarely found in literature so far. In this paper, we present a mathematically clean solution to introduce monotonicity, by introducing a new latent variable which represents the audio position or segment boundaries. We compare several monotonic latent models to our global soft attention baseline such as a hard attention model, a local windowed soft attention model, and a segmental soft attention model. We can show that our monotonic models perform as good as the global soft attention model. We perform our experiments on Switchboard 300h. We carefully outline the details of our training and release our code and configs.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/26/2022

Monotonic segmental attention for automatic speech recognition

We introduce a novel segmental-attention model for automatic speech reco...
research
12/14/2017

Monotonic Chunkwise Attention

Sequence-to-sequence models with soft attention have been successfully a...
research
08/30/2019

Initial investigation of an encoder-decoder end-to-end TTS framework using marginalization of monotonic hard latent alignments

End-to-end text-to-speech (TTS) synthesis is a method that directly conv...
research
08/29/2018

Hard Non-Monotonic Attention for Character-Level Transduction

Character-level string-to-string transduction is an important component ...
research
07/25/2023

On the Learning Dynamics of Attention Networks

Attention models are typically learned by optimizing one of three standa...
research
11/04/2016

Morphological Inflection Generation with Hard Monotonic Attention

We present a neural model for morphological inflection generation which ...
research
05/21/2020

Pitchtron: Towards audiobook generation from ordinary people's voices

In this paper, we explore prosody transfer for audiobook generation unde...

Please sign up or login with your details

Forgot password? Click here to reset