Zero-Shot End-to-End Spoken Language Understanding via Cross-Modal Selective Self-Training

05/22/2023
by   Jianfeng He, et al.
0

End-to-end (E2E) spoken language understanding (SLU) is constrained by the cost of collecting speech-semantics pairs, especially when label domains change. Hence, we explore zero-shot E2E SLU, which learns E2E SLU without speech-semantics pairs, instead using only speech-text and text-semantics pairs. Previous work achieved zero-shot by pseudolabeling all speech-text transcripts with a natural language understanding (NLU) model learned on text-semantics corpora. However, this method requires the domains of speech-text and text-semantics to match, which often mismatch due to separate collections. Furthermore, using the entire speech-text corpus from any domains leads to imbalance and noise issues. To address these, we propose cross-modal selective self-training (CMSST). CMSST tackles imbalance by clustering in a joint space of the three modalities (speech, text, and semantics) and handles label noise with a selection network. We also introduce two benchmarks for zero-shot E2E SLU, covering matched and found speech (mismatched) settings. Experiments show that CMSST improves performance in both two settings, with significantly reduced sample sizes and training time.

READ FULL TEXT

page 8

page 14

research
05/24/2022

T-Modules: Translation Modules for Zero-Shot Cross-Modal Machine Translation

We present a new approach to perform zero-shot cross-modal transfer betw...
research
07/03/2020

Pretrained Semantic Speech Embeddings for End-to-End Spoken Language Understanding via Cross-Modal Teacher-Student Learning

Spoken language understanding is typically based on pipeline architectur...
research
08/08/2020

Deep F-measure Maximization for End-to-End Speech Understanding

Spoken language understanding (SLU) datasets, like many other machine le...
research
08/28/2023

An Empirical Study of Consistency Regularization for End-to-End Speech-to-Text Translation

Consistency regularization methods, such as R-Drop (Liang et al., 2021) ...
research
11/18/2020

Tie Your Embeddings Down: Cross-Modal Latent Spaces for End-to-end Spoken Language Understanding

End-to-end (E2E) spoken language understanding (SLU) systems can infer t...
research
09/17/2023

Augmenting text for spoken language understanding with Large Language Models

Spoken semantic parsing (SSP) involves generating machine-comprehensible...
research
05/17/2020

Speech to Text Adaptation: Towards an Efficient Cross-Modal Distillation

Speech is one of the most effective means of communication and is full o...

Please sign up or login with your details

Forgot password? Click here to reset