Audio-visual representation learning aims to develop systems with human-...
In this paper, we show that representations capturing syllabic units eme...
Speech processing Universal PERformance Benchmark (SUPERB) is a leaderbo...
Automatic speech recognition research focuses on training and evaluating...
End-to-end multilingual ASR has become more appealing because of several...
Self-supervised learning via masked prediction pre-training (MPPT) has s...
Recent self-supervised learning (SSL) models have proven to learn rich
r...
We present the SUPERB challenge at SLT 2022, which aims at learning
self...
Finding word boundaries in continuous speech is challenging as there is
...
State-of-the-art encoder-decoder models (e.g. for machine translation (M...
Although supervised deep learning has revolutionized speech and audio
pr...
This paper investigates self-supervised pre-training for audio-visual sp...
We describe a method to jointly pre-train speech and text in an
encoder-...
We consider two federated learning algorithms for training partially
per...
We introduce dGSLM, the first "textless" model able to generate audio sa...
Transfer learning has proven to be crucial in advancing the state of spe...
Spoken Question Answering (SQA) is to find the answer from a spoken docu...
Textless spoken language processing research aims to extend the applicab...
Object detection is a challenging and popular computer vision problem. T...
Audio-based automatic speech recognition (ASR) degrades significantly in...
Video recordings of speech contain correlated audio and visual informati...
Speech emotion conversion is the task of modifying the perceived emotion...
With 4.5 million hours of English speech from 10 different sources acros...
Speech pre-training has primarily demonstrated efficacy on classificatio...
In this paper, we introduce the Kaizen framework that uses a continuousl...
Self-supervised approaches for speech representation learning are challe...
Self-supervised learning (SSL) has proven vital for advancing research i...
We propose using self-supervised discrete representations for the task o...
Pseudo-labeling is the most adopted method for pre-training automatic sp...
This paper presents XLSR which learns cross-lingual speech representatio...
We show for the first time that learning powerful representations from s...
Many semi- and weakly-supervised approaches have been investigated for
o...
We introduce a new collection of spoken English audio suitable for train...
We present pre-training approaches for self-supervised representation
le...
Inspired by modular software design principles of independence,
intercha...
We present BART, a denoising autoencoder for pretraining sequence-to-seq...
Supervised ASR models have reached unprecedented levels of accuracy, tha...
We propose and evaluate transformer-based acoustic models (AMs) for hybr...
The recent success of transformer networks for neural machine translatio...
Segmental structure is a common pattern in many types of sequences such ...
Yes, they do. This paper provides the first empirical demonstration that...