MossFormer: Pushing the Performance Limit of Monaural Speech Separation using Gated Single-Head Transformer with Convolution-Augmented Joint Self-Attentions

02/23/2023
by   Shengkui Zhao, et al.
0

Transformer based models have provided significant performance improvements in monaural speech separation. However, there is still a performance gap compared to a recent proposed upper bound. The major limitation of the current dual-path Transformer models is the inefficient modelling of long-range elemental interactions and local feature patterns. In this work, we achieve the upper bound by proposing a gated single-head transformer architecture with convolution-augmented joint self-attentions, named MossFormer (Monaural speech separation TransFormer). To effectively solve the indirect elemental interactions across chunks in the dual-path architecture, MossFormer employs a joint local and global self-attention architecture that simultaneously performs a full-computation self-attention on local chunks and a linearised low-cost self-attention over the full sequence. The joint attention enables MossFormer model full-sequence elemental interaction directly. In addition, we employ a powerful attentive gating mechanism with simplified single-head self-attentions. Besides the attentive long-range modelling, we also augment MossFormer with convolutions for the position-wise local pattern modelling. As a consequence, MossFormer significantly outperforms the previous models and achieves the state-of-the-art results on WSJ0-2/3mix and WHAM!/WHAMR! benchmarks. Our model achieves the SI-SDRi upper bound of 21.2 dB on WSJ0-3mix and only 0.3 dB below the upper bound of 23.1 dB on WSJ0-2mix.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/17/2020

Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation

Convolution exploits locality for efficiency at a cost of missing long r...
research
07/27/2022

Are Neighbors Enough? Multi-Head Neural n-gram can be Alternative to Self-attention

Impressive performance of Transformer has been attributed to self-attent...
research
08/06/2022

HaloAE: An HaloNet based Local Transformer Auto-Encoder for Anomaly Detection and Localization

Unsupervised anomaly detection and localization is a crucial task as it ...
research
09/19/2018

Close to Human Quality TTS with Transformer

Although end-to-end neural text-to-speech (TTS) methods (such as Tacotro...
research
11/13/2019

Compressive Transformers for Long-Range Sequence Modelling

We present the Compressive Transformer, an attentive sequence model whic...
research
09/15/2022

Beat Transformer: Demixed Beat and Downbeat Tracking with Dilated Self-Attention

We propose Beat Transformer, a novel Transformer encoder architecture fo...
research
11/15/2022

Hybrid Transformers for Music Source Separation

A natural question arising in Music Source Separation (MSS) is whether l...

Please sign up or login with your details

Forgot password? Click here to reset