Multimodal Semi-supervised Learning Framework for Punctuation Prediction in Conversational Speech

08/03/2020
by   Monica Sunkara, et al.
0

In this work, we explore a multimodal semi-supervised learning approach for punctuation prediction by learning representations from large amounts of unlabelled audio and text data. Conventional approaches in speech processing typically use forced alignment to encoder per frame acoustic features to word level features and perform multimodal fusion of the resulting acoustic and lexical representations. As an alternative, we explore attention based multimodal fusion and compare its performance with forced alignment based fusion. Experiments conducted on the Fisher corpus show that our proposed approach achieves  6-9 baseline BLSTM model on reference transcripts and ASR outputs respectively. We further improve the model robustness to ASR errors by performing data augmentation with N-best lists which achieves up to an additional  2-6 improvement on ASR outputs. We also demonstrate the effectiveness of semi-supervised learning approach by performing ablation study on various sizes of the corpus. When trained on 1 hour of speech and text data, the proposed model achieved  9-18

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/19/2021

Semi-supervised transfer learning for language expansion of end-to-end speech recognition models to low-resource languages

In this paper, we propose a three-stage training methodology to improve ...
research
06/29/2021

Alzheimer's Dementia Recognition Using Acoustic, Lexical, Disfluency and Speech Pause Features Robust to Noisy Inputs

We present two multimodal fusion-based deep learning models that consume...
research
03/13/2023

The System Description of dun_oscar team for The ICPR MSR Challenge

This paper introduces the system submitted by dun_oscar team for the ICP...
research
06/03/2019

From Speech Chain to Multimodal Chain: Leveraging Cross-modal Data Augmentation for Semi-supervised Learning

The most common way for humans to communicate is by speech. But perhaps ...
research
03/15/2023

Autonomous Soundscape Augmentation with Multimodal Fusion of Visual and Participant-linked Inputs

Autonomous soundscape augmentation systems typically use trained models ...
research
04/19/2023

A Comparison of Semi-Supervised Learning Techniques for Streaming ASR at Scale

Unpaired text and audio injection have emerged as dominant methods for i...
research
11/17/2020

ABC-Net: Semi-Supervised Multimodal GAN-based Engagement Detection using an Affective, Behavioral and Cognitive Model

We present ABC-Net, a novel semi-supervised multimodal GAN framework to ...

Please sign up or login with your details

Forgot password? Click here to reset