Audio-visual representation learning aims to develop systems with human-...
To realize human-robot collaboration, robots need to execute actions for...
We present the first unified study of the efficiency of self-attention-b...
We propose an unsupervised speech-to-speech translation (S2ST) system th...
Recent models such as XLS-R and Whisper have made multilingual speech
te...
In this paper, we show that representations capturing syllabic units eme...
We investigate the emergent abilities of the recently proposed web-scale...
Self-supervised learning (SSL) has been able to leverage unlabeled data ...
Automatic speech recognition research focuses on training and evaluating...
We apply transfer learning to the task of phoneme segmentation and
demon...
This work investigates the use of large-scale, pre-trained models (CLIP ...
Recent visuolinguistic pre-trained models show promising progress on var...
Multilingual text-video retrieval methods have improved significantly in...
Data-driven speech processing models usually perform well with a large a...
In this paper, we propose a simple yet powerful improvement over the rec...
We present a method for visually-grounded spoken term discovery. After
t...
Acoustic sensing has proved effective as a foundation for numerous
appli...
In this paper, we describe our submissions to the ZeroSpeech 2021 Challe...
Multi-modal learning from video data has seen increased attention recent...
The task of multimodal learning has seen a growing interest recently as ...
In this paper, we explore self-supervised audio-visual models that learn...
We present Fast-Slow Transformer for Visually Grounding Speech, or FaST-...
Reverberation from audio reflecting off surfaces and objects in the
envi...
When people observe events, they are able to abstract key information an...
Multimodal self-supervised learning is getting more and more attention a...
In this paper we present the first model for directly synthesizing fluen...
Current methods for learning visually grounded language from videos ofte...
In this paper, we present a method for learning discrete linguistic unit...
Transfer learning aims to reduce the amount of data required to excel at...
In this paper, we investigate the manner in which interpretable sub-word...
In this paper, we explore the learning of neural network embeddings for
...
In this paper, we explore neural network models that learn to associate
...
In this paper, we explore the unsupervised learning of a semantic embedd...
Given a collection of images and spoken audio captions, we present a met...
In this paper, we present a model which takes as input a corpus of image...