AVATAR submission to the Ego4D AV Transcription Challenge

11/18/2022
by   Paul Hongsuck Seo, et al.
6

In this report, we describe our submission to the Ego4D AudioVisual (AV) Speech Transcription Challenge 2022. Our pipeline is based on AVATAR, a state of the art encoder-decoder model for AV-ASR that performs early fusion of spectrograms and RGB images. We describe the datasets, experimental settings and ablations. Our final method achieves a WER of 68.40 on the challenge test set, outperforming the baseline by 43.7

READ FULL TEXT

page 1

page 2

page 3

research
11/03/2022

Channel-Aware Pretraining of Joint Encoder-Decoder Self-Supervised Model for Telephonic-Speech ASR

This paper proposes a novel technique to obtain better downstream ASR pe...
research
03/27/2018

Building state-of-the-art distant speech recognition using the CHiME-4 challenge with a setup of speech enhancement baseline

This paper describes a new baseline system for automatic speech recognit...
research
01/25/2021

RGB-D Salient Object Detection via 3D Convolutional Neural Networks

RGB-D salient object detection (SOD) recently has attracted increasing r...
research
10/16/2020

Adaptive Feature Selection for End-to-End Speech Translation

Information in speech signals is not evenly distributed, making it an ad...
research
06/22/2022

UniCon+: ICTCAS-UCAS Submission to the AVA-ActiveSpeaker Task at ActivityNet Challenge 2022

This report presents a brief description of our winning solution to the ...
research
05/03/2020

Bootstrapping Techniques for Polysynthetic Morphological Analysis

Polysynthetic languages have exceptionally large and sparse vocabularies...
research
06/05/2020

Multi-modal Feature Fusion with Feature Attention for VATEX Captioning Challenge 2020

This report describes our model for VATEX Captioning Challenge 2020. Fir...

Please sign up or login with your details

Forgot password? Click here to reset