Lattice-based Improvements for Voice Triggering Using Graph Neural Networks

01/25/2020 ∙ by Pranay Dighe, et al. ∙ 0

Voice-triggered smart assistants often rely on detection of a trigger-phrase before they start listening for the user request. Mitigation of false triggers is an important aspect of building a privacy-centric non-intrusive smart assistant. In this paper, we address the task of false trigger mitigation (FTM) using a novel approach based on analyzing automatic speech recognition (ASR) lattices using graph neural networks (GNN). The proposed approach uses the fact that decoding lattice of a falsely triggered audio exhibits uncertainties in terms of many alternative paths and unexpected words on the lattice arcs as compared to the lattice of a correctly triggered audio. A pure trigger-phrase detector model doesn't fully utilize the intent of the user speech whereas by using the complete decoding lattice of user audio, we can effectively mitigate speech not intended for the smart assistant. We deploy two variants of GNNs in this paper based on 1) graph convolution layers and 2) self-attention mechanism respectively. Our experiments demonstrate that GNNs are highly accurate in FTM task by mitigating  87 Furthermore, the proposed models are fast to train and efficient in parameter requirements.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Smart assistants are ubiquitous in various devices like smart speakers, mobile phones, smart watches, etc. These devices are controlled by voice commands gated by a trigger phrase or a wake-up word. While the trigger-phrase detection algorithms [10] are precise and reliable, the operating point may sometimes allow some non-trigger speech or background noise to wake up the device. The goal of this paper is to improve the quality of the smart assistant by minimizing the number of false alarms while rejecting a minimal number (ideally zero) of intended triggers.

Figure 1: Example of true and false trigger lattices.

The proposed approach utilizes ASR decoding lattices for determining whether a user request is a false trigger or not. Lattices are obtained as Weighted Finite State Transducer (WFST) graphs [8] during beam-search decoding step in ASR and they concisely represent the top few competing word-sequences hypothesized for the processed utterance. Our FTM approach is based on the hypothesis that a true (intended) utterance spoken by a user is less noisy and the best word-sequence hypothesis has zero (or few) competing hypotheses in the ASR lattice. This is illustrated in a skinny lattice shown in Figure 1(a). On the other hand, false triggers often originate either from from background noise or from speech which sounds similar to the trigger-phrase. Multiple ASR hypotheses may compete during decoding in this case and they may be present as alternate paths in the lattices of false trigger utterances. The uncertainty of the ASR decoder could also be exhibited by different distribution of the acoustic model (AM) and language model (LM) scores on the lattice arcs. An example of lattice from a false trigger utterance is shown in Figure 1(b).

Figure 2: Example of lattice represented as adjacency matrix (row-normalized) and feature matrix.

Note that we do not rely on the 1-best ASR hypothesis for FTM here because the acoustic model and language model can sometimes “hallucinate” the trigger-phrase. Instead, our approach leverages the whole ASR lattice for FTM. Along with the trigger-phrase audio, we expect to exploit the uncertainty in the post trigger-phrase audio as well. True triggers typically have device-directed speech (e.g.“hi Dan, what time is it?” with limited vocabulary and query-like grammar whereas falsely triggers may have random noise or background speech (e.g.“hey Don, let’s grab lunch”). These differences are explicitly exhibited by the decoding lattices and we model them using graph neural networks in this paper. Specifically, we use graph convolution networks (GCN) and self-attention based graph neural networks (SAGNN). The FTM task undertaken in this paper differs from the voice trigger (VT) detection task because we analyze the whole utterance as opposed to a pure trigger detector which focuses on the hypothesized trigger-phrase audio only.

Prior work on FTM mainly comprises of research on key-word spotting and wake-up word detection. VT detection approaches typically rely on multi-stage neural network based processing of acoustic features to determine the presence of the wake-word [10, 5, 13, 1, 9]. These approaches often use ASR as an auxiliary task to aid the VT detection task. Lattice-based FTM has been successfully explored in [3, 2] and has shown that confusion in ASR lattices acts as a strong evidence of false alarm. Prior work on GNNs has been comprehensively summarized in [14, 4, 6, 15]; and [11] demonstrates use of self-attention based GNNs on lattice inputs for a machine translation task.

The rest of the paper is organized as follows. Section 2 discusses how to convert lattices into appropriate features for GNN based processing. Section 3 explains our GCN and SAGNN based FTM approach. Section 4 provides details of FTM experiments, their results and analysis. Finally, Section 5 draws conclusion of this work.

2 Lattices as Input Features for FTM

In our experimental system, when a trigger detection mechanism detects a trigger, the system starts processing user audio using a full-blown ASR system. A dedicated algorithm determines the end-of-speech event at which point we obtain the ASR output and the decoding lattice. We use word-aligned lattices such that each arc corresponds to a hypothesized word and derive feature vectors for lattice arcs in a manner similar to

[3].

Figure 3: Graph convolution operation on a lattice arc aggregates information from all neighbors at a distance on one hop.

Lattices can be visualized as directed acyclic graphs defined using a collection of nodes and edges. Denoting lattice arcs as nodes of the graph, there is a directed edge from one node to another if the underlying arcs are connected in the lattice (i.e. the end-state of first arc is the start-state of the second arc). Each node (or arc) has a feature vector associated with it. The FTM task is to take a lattice as a graph input and do a binary classification between true trigger and false trigger class. Formally, if is the set of arcs in the lattice where is the total number of arcs and each arc has a feature vector , we can express the lattice in terms of following two matrices:

  • Adjacency Matrix, : where if arc is connected to arc otherwise , and

  • Feature Matrix : which contains arc features row-wise in the same order as the arcs appear in .

We design to be symmetric so that it captures both forward and backward connections in the lattice. Each arc is also considered as adjacent to itself. An example of a lattice represented as and is shown in Figure 2.

3 Graph Neural Networks for FTM

In this section, we explain the graph convolution and self-attention based GNN architectures for FTM task.

3.1 Graph Convolution Network

A graph convolution network [4, 14] differs from a vanilla CNN as it generalizes the 2-D convolution operation to a generic graph structure. For each node in a given graph, the information is accumulated from all its neighbor nodes. The graph does not need to be having a 2-D grid structure (e.g. pixels in an image). Each node in the graph can have different number of neighbors.

Figure 3 depicts a graph convolution operation using a lattice as an example. The arc has arc , , and as neighbors. If ’s denote the hidden feature vectors associated with the arcs in hidden layer of a GCN, the graph convolution operation accumulates the information from the neighbors of arc to determine its hidden feature vector in the next layer as follows:

(1)

where is a non-linear function and ’s denote weight parameters of layer. Such a graph convolution operation is computed on each arc of the graph simultaneously so that hidden feature vectors of all the arcs are updated.

Figure 4: (a) Self-attention GNN layer, (b) Masked Self-attention GNN layer

Given a lattice in terms of adjacency matrix , arc features , and a -layer deep GCN with weight matrices as learnable parameters for each hidden layer with dimensions and , the layerwise transformation is as follows:

(2)

where , non-linear function is ReLU activation and . The product

performs a linear transformation of arc features and the product with adjacency matrix imposes the lattice structure using the graph convolution operation. Therefore,

column of , which corresponding to arc ’s hidden feature vector, accumulates information from all the neighboring arcs’ hidden feature vectors from the previous GC layer. At the output of last GC layer, we have hidden feature vectors for each arc in . We combine the hidden features in using average pooling to get an overall lattice-embedding. This embedding is then processed using a fully connected hidden layer followed by sigmoid

non-linearity to predict the probability of the input lattice being a false trigger. The adjacency matrix

is fixed for a given lattice and doesn’t change for different hidden layers of the GCN. Since is row-normalized, the aggregation of information in GC operation preserves the numerical scale of the hidden feature vectors.

3.2 Deep-residual Graph Convolution Network

In each hidden layer in GCN above, the operation

aggregates information only from the neighbors at a distance of one hop. With more hidden layers, each arc successively receives information from farther away neighbors in the lattice. Therefore, the depth of the GCN controls how the lattice-level information is accumulated in the lattice-embedding computed at the final GCN layer. In order to train deep GCNs, we use residual connections and batch-normalization layers in our model architecture. Figure

5 shows a residual block containing two GC operations. We stack such residual blocks to train deeper GCNs in this paper.

Figure 5: Residual block for GCN

3.3 Self-Attention based Graph Neural Network

Self-attention graph neural networks (SAGNN) [12, 11, 6, 15] model the relationship between the nodes of a graph using the self-attention mechanism instead of the predefined edges in the graph. Instead of using a fixed adjacency matrix as in GCNs, this approach uses the inner-product of the feature vectors of the lattice arcs to compute their relevance to each other. As the arc features change in successive hidden layers of the network, their mutual relevance with respect to each other changes as well. Therefore, SAGNN can be seen as a generalization of GCNs where the adjacency matrix is not static anymore but determined dynamically.

We use the multi-headed self-attention mechanism [12] as shown in Figure 4(a). Each head acts as a unique filter to determine queries , keys , and values from the arc features. Self-attention computed as is used to linearly combine the values together. Outputs of all the heads are then concatenated together to generate the overall output of one SAGNN hidden layer. We use multiple such layers stacked together followed by a final hidden layer for false trigger classification.

Self-attention mechanism ignores the lattice structure completely as may have non-zero attention values for arc pairs which are not connected to each other or do not appear in the same path in the lattice. To impose the lattice structural constraints, we propose to mask the self-attention matrix using the lattice’s adjacency matrix so that the masked self-attention (shown in Figure 4(b)) is defined as where denotes Hadamard product. Masking ensures that we only aggregate information from arcs which are connected to each other while the aggregation is still weighted using self-attention mechanism. Therefore, masked SAGNN architecture can be seen as a hybrid of GCN and SAGNN approach.

4 Experimental Analysis

4.1 Dataset, Features, and Evaluation Metrics

Table 1 summarizes the FTM dataset used in our experiments. For training the models, we augment the train and cv set to make it 3x bigger by adding gain, noise, and speed perturbations. The input lattice arc feature is a 20-dim vector similar to [3]

which comprises of 14-dim bag-of-phones embedding from a bottleneck autoencoder, AM score, LM score, arc log-posterior probability, number of frames in the arc, and two binary dimensions which are set as “1” if the arc corresponds to the words of the trigger phrase. All models have 64 dimensional hidden state vectors

and a binary output which is evaluated using binary cross entropy loss. The models are evaluated on the eval set using ROC curves comparing true positive rate (TPR) and false alarm rate (FAR) metrics. We expect our devices to have minimal false alarms and maximum true positives for a good user-experience. Therefore, we focus on the high TPR () regime in our ROC curves and prefer models which have largest area under the curve (AUC).

Class train cv eval
True Triggers 42,675 4,742 11,646
False Triggers 18,669 2,074 11,316
Table 1: Dataset for false trigger mitigation task.

4.2 Results and Analysis

System # of params AUC FAR
ASR-Output - - 0.868
BiLRNN 15,041 0.9906 0.134
GCN 6 layers 26,369 0.9888 0.185
ResGCN 8 layers 70,209 0.9905 0.155
SAGNN 2 layers/4 heads 39,105 0.9906 0.150
MaskedSAGNN 2 layers/4 heads 39,105 0.9914 0.134
Table 2: FTM results using various systems. FAR shown at TPR=0.99 in all systems (except ASR-Output where TPR=0.9994).

We compare the proposed GNN based approaches to two baseline systems: 1) FTM using ASR outputs and 2) the lattice RNN approach from [3]. The former is a weak baseline which detects a false alarm if the top ASR hypothesis does not contain the predefined trigger phrase. The latter is a competitive system as it processes the complete lattice using a RNN to detect false alarms.

Figure 6 compares various GCN and deep residual GCN models using the AUC metric. We observe that FTM performance of GCN models improves with increasing depth until it starts getting worse after 6 hidden layers. This observation is in line with [7, 14] which explore the difficulty in training deep GCN networks.

Figure 6: AUC for GCN and deep-residual GCN models as a function of number of hidden layers and residual blocks respectively.
Figure 7: ROC curve for GCNs, self-attention GNNs and bi-directional lattice RNN FTM systems.

We alleviate this issue partially by using residual connections in GCNs (ResGCN) and these models are found to be consistently better than vanilla GCN models. Note that each residual block has 2 GC operations. Therefore, the best ResGCN model with 8 residual blocks has 16 GC operations as compared to 6 GC operations in the best GCN model. ROC curves of best GCN and ResGCN models are shown in Figure 7 along with curves for the self-attention GNN models and the baseline bi-directional lattice RNN model. SAGNN and MaskedSAGNN models outperform GCN and ResGCN models conveniently with only 2 hidden layers and 4 attention heads. These models benefit due to the multi-head attention mechanism which can be visualized as parallelly acting GC operations. Furthermore, the self-attention values are dynamically learned as opposed to the static adjacency matrix used in GCNs, thereby, focusing attention on the relevant parts of the lattice for the FTM task. MaskedSAGNN improves over SAGNN which demonstrates the importance of using lattice structural information in detecting a false trigger. The masking operation may reveal the uncertainty of ASR by exposing the presence of alternative paths in the lattices.

Table 2 further compares our models to the baseline systems. ASR based baseline is not tunable and has a false alarm rate of 0.868. In comparison, lattice RNN baseline and the masked SAGNN approach perform similarly by mitigating (FAR=0.134) of false alarms at false rejects (TPR=0.99). Lattice RNNs rely on recurrent operations which result in efficient parameter sharing across the time dimension. While the lattice RNNs perform similar to the GNN-based FTM, we found that they are often inconvenient due to increased training time as compared to the GNNs (

8min/epoch for RNN training v/s

1.5min/epoch for GNN training in our experiments). As observed in [11]

, unique connections in lattices inhibit efficient batching of training examples during RNN training. On the other hand, in GNNs, recurrent computations are replaced by graph convolution operations and multiple lattices of different sizes and structures can be efficiently batched together using zero padding resulting in substantial training speed-up.

5 Conclusions

In this paper, we focused on the false trigger mitigation task which improves a voice-triggered smart assistant by making it less intrusive and more privacy preserving. We proposed a novel solution to perform lattice-based FTM using graph neural networks. Lattices are highly informative for distinguishing true triggers from false triggers. We demonstrated that by processing lattices using GNNs, a majority of false alarms can be mitigated at the expense of rejecting few true triggers. We explored two variants of GNNs - namely GCN and self-attention GNN models. We showed that the GCN-based FTM can be improved by increasing network-depth using residual connections. Alternatively, self-attention based GNN models can be improved using adjacency matrix-based masking. We also found that GNN models are fast to train due to efficient processing of lattices using graph convolution operation. In future, we plan to extend GNN-based lattice processing to other tasks such as user-intent classification.

References

  • [1] J. Guo, K. Kumatani, M. Sun, M. Wu, A. Raju, N. Ström, and A. Mandal (2018) Time-delayed bottleneck highway networks using a dft feature for keyword spotting. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5489–5493. Cited by: §1.
  • [2] C. Haung, R. Maas, S. H. Mallidi, and B. Hoffmeister (2019) A study for improving device-directed speech detection toward frictionless human-machine interaction. In Proc. Interspeech, Cited by: §1.
  • [3] W. Jeon, L. Liu, and H. Mason (2019-05)

    Voice trigger detection from lvcsr hypothesis lattices using bidirectional lattice recurrent neural networks

    .
    In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 6356–6360. External Links: Document, ISSN Cited by: §1, §2, §4.1, §4.2.
  • [4] T. N. Kipf and M. Welling (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §1, §3.1.
  • [5] K. Kumatani, S. Panchapagesan, M. Wu, M. Kim, N. Strom, G. Tiwari, and A. Mandai (2017) Direct modeling of raw audio with dnns for wake word detection. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 252–257. Cited by: §1.
  • [6] J. Lee, I. Lee, and J. Kang (2019) Self-attention graph pooling. In

    International Conference on Machine Learning

    ,
    pp. 3734–3743. Cited by: §1, §3.3.
  • [7] Q. Li, Z. Han, and X. Wu (2018)

    Deeper insights into graph convolutional networks for semi-supervised learning

    .
    In

    Thirty-Second AAAI Conference on Artificial Intelligence

    ,
    Cited by: §4.2.
  • [8] M. Mohri, F. Pereira, and M. Riley (2002) Weighted finite-state transducers in speech recognition. Computer Speech & Language 16 (1), pp. 69–88. Cited by: §1.
  • [9] A. Norouzian, B. Mazoure, D. Connolly, and D. Willett (2019) Exploring attention mechanism for acoustic-based classification of speech utterances into system-directed and non-system-directed. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7310–7314. Cited by: §1.
  • [10] S. Sigtia, R. Haynes, H. Richards, E. Marchi, and J. Bridle (2018) Efficient voice trigger detection for low resource hardware. In Proc. Interspeech 2018, pp. 2092–2096. External Links: Document, Link Cited by: §1, §1.
  • [11] M. Sperber, G. Neubig, N. Pham, and A. Waibel (2019-07)

    Self-attentional models for lattice inputs

    .
    In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 1185–1197. External Links: Link, Document Cited by: §1, §3.3, §4.2.
  • [12] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §3.3, §3.3.
  • [13] M. Wu, S. Panchapagesan, M. Sun, J. Gu, R. Thomas, S. N. Prasad Vitaladevuni, B. Hoffmeister, and A. Mandal (2018-04) Monophone-based background modeling for two-stage on-device wake word detection. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 5494–5498. External Links: Document, ISSN Cited by: §1.
  • [14] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu (2019) A comprehensive survey on graph neural networks. arXiv preprint arXiv:1901.00596. Cited by: §1, §3.1, §4.2.
  • [15] J. Zhang, X. Shi, J. Xie, H. Ma, I. King, and D. Yeung (2018) GaAN: gated attention networks for learning on large and spatiotemporal graphs. In UAI, Cited by: §1, §3.3.