From Speech Chain to Multimodal Chain: Leveraging Cross-modal Data Augmentation for Semi-supervised Learning

06/03/2019
by   Johanes Effendi, et al.
0

The most common way for humans to communicate is by speech. But perhaps a language system cannot know what it is communicating without a connection to the real world by image perception. In fact, humans perceive these multiple sources of information together to build a general concept. However, constructing a machine that can alleviate these modalities together in a supervised learning fashion is difficult, because a parallel dataset is required among speech, image, and text modalities altogether that is often unavailable. A machine speech chain based on sequence-to-sequence deep learning was previously proposed to achieve semi-supervised learning that enabled automatic speech recognition (ASR) and text-to-speech synthesis (TTS) to teach each other when they receive unpaired data. In this research, we take a further step by expanding the speech chain into a multimodal chain and design a closely knit chain architecture that connects ASR, TTS, image captioning (IC), and image retrieval (IR) models into a single framework. ASR, TTS, IC, and IR components can be trained in a semi-supervised fashion by assisting each other given incomplete datasets and leveraging cross-modal data augmentation within the chain.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/04/2020

Augmenting Images for ASR and TTS through Single-loop and Dual-loop Multimodal Chain Framework

Previous research has proposed a machine speech chain to enable automati...
research
05/14/2022

Improved Consistency Training for Semi-Supervised Sequence-to-Sequence ASR via Speech Chain Reconstruction and Self-Transcribing

Consistency regularization has recently been applied to semi-supervised ...
research
01/08/2023

SpeeChain: A Speech Toolkit for Large-Scale Machine Speech Chain

This paper introduces SpeeChain, an open-source Pytorch-based toolkit de...
research
11/04/2020

Cross-Lingual Machine Speech Chain for Javanese, Sundanese, Balinese, and Bataks Speech Recognition and Synthesis

Even though over seven hundred ethnic languages are spoken in Indonesia,...
research
08/03/2020

Multimodal Semi-supervised Learning Framework for Punctuation Prediction in Conversational Speech

In this work, we explore a multimodal semi-supervised learning approach ...
research
07/16/2017

Listening while Speaking: Speech Chain by Deep Learning

Despite the close relationship between speech perception and production,...
research
11/04/2020

Incremental Machine Speech Chain Towards Enabling Listening while Speaking in Real-time

Inspired by a human speech chain mechanism, a machine speech chain frame...

Please sign up or login with your details

Forgot password? Click here to reset