SpecTNT: a Time-Frequency Transformer for Music Audio

10/18/2021
by   Wei-Tsung Lu, et al.
13

Transformers have drawn attention in the MIR field for their remarkable performance shown in natural language processing and computer vision. However, prior works in the audio processing domain mostly use Transformer as a temporal feature aggregator that acts similar to RNNs. In this paper, we propose SpecTNT, a Transformer-based architecture to model both spectral and temporal sequences of an input time-frequency representation. Specifically, we introduce a novel variant of the Transformer-in-Transformer (TNT) architecture. In each SpecTNT block, a spectral Transformer extracts frequency-related features into the frequency class token (FCT) for each frame. Later, the FCTs are linearly projected and added to the temporal embeddings (TEs), which aggregate useful information from the FCTs. Then, a temporal Transformer processes the TEs to exchange information across the time axis. By stacking the SpecTNT blocks, we build the SpecTNT model to learn the representation for music signals. In experiments, SpecTNT demonstrates state-of-the-art performance in music tagging and vocal melody extraction, and shows competitive performance for chord recognition. The effectiveness of SpecTNT and other design choices are further examined through ablation studies.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/29/2022

Modeling Beats and Downbeats with a Time-Frequency Transformer

Transformer is a successful deep neural network (DNN) architecture that ...
research
07/10/2023

Automatic Piano Transcription with Hierarchical Frequency-Time Transformer

Taking long-term spectral and temporal dependencies into account is esse...
research
11/03/2018

Time-Frequency Audio Features for Speech-Music Classification

Distinct striation patterns are observed in the spectrograms of speech a...
research
05/01/2021

Audio Transformers:Transformer Architectures For Large Scale Audio Understanding. Adieu Convolutions

Over the past two decades, CNN architectures have produced compelling mo...
research
09/21/2021

Audiomer: A Convolutional Transformer for Keyword Spotting

Transformers have seen an unprecedented rise in Natural Language Process...
research
03/19/2023

Multiscale Audio Spectrogram Transformer for Efficient Audio Classification

Audio event has a hierarchical architecture in both time and frequency a...
research
02/02/2022

TONet: Tone-Octave Network for Singing Melody Extraction from Polyphonic Music

Singing melody extraction is an important problem in the field of music ...

Please sign up or login with your details

Forgot password? Click here to reset