Streaming Parrotron for on-device speech-to-speech conversion

10/25/2022
by   Oleg Rybakov, et al.
0

We present a fully on-device and streaming Speech-To-Speech (STS) conversion model that normalizes a given input speech directly to synthesized output speech (a.k.a. Parrotron). Deploying such an end-to-end model locally on mobile devices pose significant challenges in terms of memory footprint and computation requirements. In this paper, we present a streaming-based approach to produce an acceptable delay, with minimal loss in speech conversion quality, when compared to a non-streaming server-based approach. Our approach consists of first streaming the encoder in real time while the speaker is speaking. Then, as soon as the speaker stops speaking, we run the spectrogram decoder in streaming mode along the side of a streaming vocoder to generate output speech in real time. To achieve an acceptable delay quality trade-off, we study a novel hybrid approach for look-ahead in the encoder which combines a look-ahead feature stacker with a look-ahead self-attention. We also compare the model with int4 quantization aware training and int8 post training quantization and show that our streaming approach is 2x faster than real time on the Pixel4 CPU.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/15/2022

Streaming non-autoregressive model for any-to-many voice conversion

Voice conversion models have developed for decades, and current mainstre...
research
03/01/2022

Real time spectrogram inversion on mobile phone

With the growth of computing power on mobile phones and privacy concerns...
research
11/02/2022

Conversation-oriented ASR with multi-look-ahead CBS architecture

During conversations, humans are capable of inferring the intention of t...
research
10/25/2022

Highly Efficient Real-Time Streaming and Fully On-Device Speaker Diarization with Multi-Stage Clustering

While recent research advances in speaker diarization mostly focus on im...
research
11/15/2018

Streaming End-to-end Speech Recognition For Mobile Devices

End-to-end (E2E) models, which directly predict output character sequenc...
research
05/14/2020

Streaming keyword spotting on mobile devices

In this work we explore the latency and accuracy of keyword spotting (KW...
research
05/21/2023

DualVC: Dual-mode Voice Conversion using Intra-model Knowledge Distillation and Hybrid Predictive Coding

Voice conversion is an increasingly popular technology, and the growing ...

Please sign up or login with your details

Forgot password? Click here to reset