OxfordVGG Submission to the EGO4D AV Transcription Challenge

07/18/2023
by   Jaesung Huh, et al.
0

This report presents the technical details of our submission on the EGO4D Audio-Visual (AV) Automatic Speech Recognition Challenge 2023 from the OxfordVGG team. We present WhisperX, a system for efficient speech transcription of long-form audio with word-level time alignment, along with two text normalisers which are publicly available. Our final submission obtained 56.0 leaderboard. All baseline codes and models are available on https://github.com/m-bain/whisperX.

READ FULL TEXT

page 1

page 2

page 3

research
03/01/2023

MuAViC: A Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speech-to-Text Translation

We introduce MuAViC, a multilingual audio-visual corpus for robust speec...
research
01/16/2020

SqueezeWave: Extremely Lightweight Vocoders for On-device Speech Synthesis

Automatic speech synthesis is a challenging task that is becoming increa...
research
03/01/2023

WhisperX: Time-Accurate Speech Transcription of Long-Form Audio

Large-scale, weakly-supervised speech recognition models, such as Whispe...
research
10/08/2021

Phone-to-audio alignment without text: A Semi-supervised Approach

The task of phone-to-audio alignment has many applications in speech res...
research
07/14/2023

AudioInceptionNeXt: TCL AI LAB Submission to EPIC-SOUND Audio-Based-Interaction-Recognition Challenge 2023

This report presents the technical details of our submission to the 2023...
research
11/07/2022

Technical Report on Web-based Visual Corpus Construction for Visual Document Understanding

We present a dataset generator engine named Web-based Visual Corpus Buil...
research
10/17/2022

An Open-source Benchmark of Deep Learning Models for Audio-visual Apparent and Self-reported Personality Recognition

Personality is crucial for understanding human internal and external sta...

Please sign up or login with your details

Forgot password? Click here to reset