Tie Your Embeddings Down: Cross-Modal Latent Spaces for End-to-end Spoken Language Understanding

by   Bhuvan Agrawal, et al.

End-to-end (E2E) spoken language understanding (SLU) systems can infer the semantics of a spoken utterance directly from an audio signal. However, training an E2E system remains a challenge, largely due to the scarcity of paired audio-semantics data. In this paper, we treat an E2E system as a multi-modal model, with audio and text functioning as its two modalities, and use a cross-modal latent space (CMLS) architecture, where a shared latent space is learned between the `acoustic' and `text' embeddings. We propose using different multi-modal losses to explicitly guide the acoustic embeddings to be closer to the text embeddings, obtained from a semantically powerful pre-trained BERT model. We train the CMLS model on two publicly available E2E datasets, across different cross-modal losses and show that our proposed triplet loss function achieves the best performance. It achieves a relative improvement of 1.4 space and a relative improvement of 0.7 CMLS model using L_2 loss. The gains are higher for a smaller, more complicated E2E dataset, demonstrating the efficacy of using an efficient cross-modal loss function, especially when there is limited E2E training data available.


ST-BERT: Cross-modal Language Model Pre-training For End-to-end Spoken Language Understanding

Language model pre-training has shown promising results in various downs...

Learning Audio-Text Agreement for Open-vocabulary Keyword Spotting

In this paper, we propose a novel end-to-end user-defined keyword spotti...

Speak or Chat with Me: End-to-End Spoken Language Understanding System with Flexible Inputs

A major focus of recent research in spoken language understanding (SLU) ...

Cross-modal Contrastive Learning for Speech Translation

How can we learn unified representations for spoken utterances and their...

Speech to Text Adaptation: Towards an Efficient Cross-Modal Distillation

Speech is one of the most effective means of communication and is full o...

Embed Everything: A Method for Efficiently Co-Embedding Multi-Modal Spaces

Any general artificial intelligence system must be able to interpret, op...

Integration of Pre-trained Networks with Continuous Token Interface for End-to-End Spoken Language Understanding

Most End-to-End (E2E) SLU networks leverage the pre-trained ASR networks...