A Textless Metric for Speech-to-Speech Comparison
This paper proposes a textless speech-to-speech comparison metric that allows comparing a speech hypothesis with a speech reference without falling-back to their text transcripts. We leverage recently proposed speech2unit encoders (such as HuBERT) to pseudo-transcribe the speech utterances into discrete acoustic units and propose a simple neural architecture that learns a speech-based metric which correlates well with its text-based counterpart. Such a textless metric could ultimately be interesting for speech-to-speech translation evaluation (for oral languages or languages with no reliable ASR system available).
READ FULL TEXT