Vector Quantized Diffusion Model with CodeUnet for Text-to-Sign Pose Sequences Generation

08/19/2022
by   Pan Xie, et al.
0

Sign Language Production (SLP) aims to translate spoken languages into sign sequences automatically. The core process of SLP is to transform sign gloss sequences into their corresponding sign pose sequences (G2P). Most existing G2P models usually perform this conditional long-range generation in an autoregressive manner, which inevitably leads to an accumulation of errors. To address this issue, we propose a vector quantized diffusion method for conditional pose sequences generation, called PoseVQ-Diffusion, which is an iterative non-autoregressive method. Specifically, we first introduce a vector quantized variational autoencoder (Pose-VQVAE) model to represent a pose sequence as a sequence of latent codes. Then we model the latent discrete space by an extension of the recently developed diffusion architecture. To better leverage the spatial-temporal information, we introduce a novel architecture, namely CodeUnet, to generate higher quality pose sequence in the discrete space. Moreover, taking advantage of the learned codes, we develop a novel sequential k-nearest-neighbours method to predict the variable lengths of pose sequences for corresponding gloss sequences. Consequently, compared with the autoregressive G2P models, our model has a faster sampling speed and produces significantly better results. Compared with previous non-autoregressive G2P methods, PoseVQ-Diffusion improves the predicted results with iterative refinements, thus achieving state-of-the-art results on the SLP evaluation benchmark.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/29/2021

Vector Quantized Diffusion Model for Text-to-Image Synthesis

We present the vector quantized diffusion (VQ-Diffusion) model for text-...
research
03/30/2021

Symbolic Music Generation with Diffusion Models

Score-based generative models and diffusion probabilistic models have be...
research
08/12/2022

Non-Autoregressive Sign Language Production via Knowledge Distillation

Sign Language Production (SLP) aims to translate expressions in spoken l...
research
09/21/2023

Autoregressive Sign Language Production: A Gloss-Free Approach with Discrete Representations

Gloss-free Sign Language Production (SLP) offers a direct translation of...
research
11/24/2022

Ham2Pose: Animating Sign Language Notation into Pose Sequences

Translating spoken languages into Sign languages is necessary for open c...
research
07/20/2022

Diffsound: Discrete Diffusion Model for Text-to-sound Generation

Generating sound effects that humans want is an important topic. However...
research
04/07/2023

ChiroDiff: Modelling chirographic data with Diffusion Models

Generative modelling over continuous-time geometric constructs, a.k.a su...

Please sign up or login with your details

Forgot password? Click here to reset