SPEECH-COCO: 600k Visually Grounded Spoken Captions Aligned to MSCOCO Data Set

07/26/2017
by   William Havard, et al.
0

This paper presents an augmentation of MSCOCO dataset where speech is added to image and text. Speech captions are generated using text-to-speech (TTS) synthesis resulting in 616,767 spoken captions (more than 600h) paired with images. Disfluencies and speed perturbation are added to the signal in order to sound more natural. Each speech signal (WAV) is paired with a JSON file containing exact timecode for each word/syllable/phoneme in the spoken caption. Such a corpus could be used for Language and Vision (LaVi) tasks including speech input or output instead of text. Investigating multimodal learning schemes for unsupervised speech pattern discovery is also possible with this corpus, as demonstrated by a preliminary study conducted on a subset of the corpus (10h, 10k spoken captions).

READ FULL TEXT

page 3

page 5

research
12/31/2020

Text-Free Image-to-Speech Synthesis Using Learned Segmental Units

In this paper we present the first model for directly synthesizing fluen...
research
12/21/2018

Symbolic inductive bias for visually grounded learning of spoken language

A widespread approach to processing spoken language is to first automati...
research
11/11/2015

Deep Multimodal Semantic Embeddings for Speech and Images

In this paper, we present a model which takes as input a corpus of image...
research
07/25/2022

ConceptBeam: Concept Driven Target Speech Extraction

We propose a novel framework for target speech extraction based on seman...
research
03/28/2022

Word Discovery in Visually Grounded, Self-Supervised Speech Models

We present a method for visually-grounded spoken term discovery. After t...
research
10/03/2022

SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Model

Data-driven speech processing models usually perform well with a large a...
research
02/22/2022

Hidden bawls, whispers, and yelps: can text be made to sound more than just its words?

Whether a word was bawled, whispered, or yelped, captions will typically...

Please sign up or login with your details

Forgot password? Click here to reset