A Case Study on Combining ASR and Visual Features for Generating Instructional Video Captions

10/07/2019
by   Jack Hessel, et al.
0

Instructional videos get high-traffic on video sharing platforms, and prior work suggests that providing time-stamped, subtask annotations (e.g., "heat the oil in the pan") improves user experiences. However, current automatic annotation methods based on visual features alone perform only slightly better than constant prediction. Taking cues from prior work, we show that we can improve performance significantly by considering automatic speech recognition (ASR) tokens as input. Furthermore, jointly modeling ASR tokens and visual features results in higher performance compared to training individually on either modality. We find that unstated background information is better explained by visual features, whereas fine-grained distinctions (e.g., "add oil" vs. "add olive oil") are disambiguated more easily via ASR tokens.

READ FULL TEXT
research
09/20/2021

Audio-Visual Speech Recognition is Worth 32×32×8 Voxels

Audio-visual automatic speech recognition (AV-ASR) introduces the video ...
research
05/31/2023

ViLaS: Integrating Vision and Language into Automatic Speech Recognition

Employing additional multimodal information to improve automatic speech ...
research
02/09/2023

Leveraging supplementary text data to kick-start automatic speech recognition system development with limited transcriptions

Recent research using pre-trained transformer models suggests that just ...
research
02/06/2020

Generating diverse and natural text-to-speech samples using a quantized fine-grained VAE and auto-regressive prosody prior

Recent neural text-to-speech (TTS) models with fine-grained latent featu...
research
10/05/2020

Fine-Grained Grounding for Multimodal Speech Recognition

Multimodal automatic speech recognition systems integrate information fr...
research
05/22/2023

CopyNE: Better Contextual ASR by Copying Named Entities

Recent years have seen remarkable progress in automatic speech recogniti...
research
04/29/2020

Beyond Instructional Videos: Probing for More Diverse Visual-Textual Grounding on YouTube

Pretraining from unlabelled web videos has quickly become the de-facto m...

Please sign up or login with your details

Forgot password? Click here to reset