Can Language Models Learn to Listen?

08/21/2023
by   Evonne Ng, et al.
0

We present a framework for generating appropriate facial responses from a listener in dyadic social interactions based on the speaker's words. Given an input transcription of the speaker's words with their timestamps, our approach autoregressively predicts a response of a listener: a sequence of listener facial gestures, quantized using a VQ-VAE. Since gesture is a language component, we propose treating the quantized atomic motion elements as additional language token inputs to a transformer-based large language model. Initializing our transformer with the weights of a language model pre-trained only on text results in significantly higher quality listener responses than training a transformer from scratch. We show that our generated listener motion is fluent and reflective of language semantics through quantitative metrics and a qualitative user study. In our evaluation, we analyze the model's ability to utilize temporal and semantic aspects of spoken text. Project page: https://people.eecs.berkeley.edu/ evonne_ng/projects/text2listen/

READ FULL TEXT
research
09/12/2021

TEASEL: A Transformer-Based Speech-Prefixed Language Model

Multimodal language analysis is a burgeoning field of NLP that aims to s...
research
07/30/2021

Structural Guidance for Transformer Language Models

Transformer-based language models pre-trained on large amounts of text d...
research
03/20/2023

Language Model Behavior: A Comprehensive Survey

Transformer language models have received widespread public attention, y...
research
05/18/2020

GPT-too: A language-model-first approach for AMR-to-text generation

Abstract Meaning Representations (AMRs) are broad-coverage sentence-leve...
research
11/26/2019

Relevance-Promoting Language Model for Short-Text Conversation

Despite the effectiveness of sequence-to-sequence framework on the task ...
research
11/06/2022

I Hear Your True Colors: Image Guided Audio Generation

We propose Im2Wav, an image guided open-domain audio generation system. ...
research
11/18/2021

Transformer-S2A: Robust and Efficient Speech-to-Animation

We propose a novel robust and efficient Speech-to-Animation (S2A) approa...

Please sign up or login with your details

Forgot password? Click here to reset