Acoustic word embeddings are typically created by training a pooling fun...
English is the most widely spoken language in the world, used daily by
m...
Given the strong results of self-supervised models on various tasks, the...
Human speech data comprises a rich set of domain factors such as accent,...
Word segmentation, the problem of finding word boundaries in speech, is ...
Speech-based image retrieval has been studied as a proxy for joint
repre...
Visual context has been shown to be useful for automatic speech recognit...
Multimodal automatic speech recognition systems integrate information fr...
Speech is understood better by using visual context; for this reason, th...
In Neural Machine Translation (NMT) the usage of subwords and characters...
Multimodal learning allows us to leverage information from multiple sour...
A vast amount of audio-visual data is available on the Internet thanks t...
Humans are capable of processing speech by making use of multiple sensor...
In this paper, we introduce How2, a multimodal collection of instruction...
In Automatic Speech Recognition, it is still challenging to learn useful...
Transcription or sub-titling of open-domain videos is still a challengin...
Techniques for multi-lingual and cross-lingual speech recognition can he...
This paper proposes a novel approach to create an unit set for CTC based...
Connectionist Temporal Classification has recently attracted a lot of
in...
Speech is one of the most effective ways of communication among humans. ...