Audio-to-Intent Using Acoustic-Textual Subword Representations from End-to-End ASR

10/21/2022
by   Pranay Dighe, et al.
5

Accurate prediction of the user intent to interact with a voice assistant (VA) on a device (e.g. on the phone) is critical for achieving naturalistic, engaging, and privacy-centric interactions with the VA. To this end, we present a novel approach to predict the user's intent (the user speaking to the device or not) directly from acoustic and textual information encoded at subword tokens which are obtained via an end-to-end ASR model. Modeling directly the subword tokens, compared to modeling of the phonemes and/or full words, has at least two advantages: (i) it provides a unique vocabulary representation, where each token has a semantic meaning, in contrast to the phoneme-level representations, (ii) each subword token has a reusable "sub"-word acoustic pattern (that can be used to construct multiple full words), resulting in a largely reduced vocabulary space than of the full words. To learn the subword representations for the audio-to-intent classification, we extract: (i) acoustic information from an E2E-ASR model, which provides frame-level CTC posterior probabilities for the subword tokens, and (ii) textual information from a pre-trained continuous bag-of-words model capturing the semantic meaning of the subword tokens. The key to our approach is the way it combines acoustic subword-level posteriors with text information using the notion of positional-encoding in order to account for multiple ASR hypotheses simultaneously. We show that our approach provides more robust and richer representations for audio-to-intent classification, and is highly accurate with correctly mitigating 93.3 assistant at 99

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/06/2023

FoundationTTS: Text-to-Speech for ASR Customization with Generative Language Model

Neural text-to-speech (TTS) generally consists of cascaded architecture ...
research
10/20/2020

Knowledge Transfer for Efficient On-device False Trigger Mitigation

In this paper, we address the task of determining whether a given uttera...
research
04/15/2021

Integration of Pre-trained Networks with Continuous Token Interface for End-to-End Spoken Language Understanding

Most End-to-End (E2E) SLU networks leverage the pre-trained ASR networks...
research
11/28/2017

Acoustic-To-Word Model Without OOV

Recently, the acoustic-to-word model based on the Connectionist Temporal...
research
10/31/2021

FANS: Fusing ASR and NLU for on-device SLU

Spoken language understanding (SLU) systems translate voice input comman...
research
08/13/2020

Textual Echo Cancellation

In this paper, we propose Textual Echo Cancellation (TEC) - a framework ...
research
12/02/2022

AccEar: Accelerometer Acoustic Eavesdropping with Unconstrained Vocabulary

With the increasing popularity of voice-based applications, acoustic eav...

Please sign up or login with your details

Forgot password? Click here to reset