Streaming ResLSTM with Causal Mean Aggregation for Device-Directed Utterance Detection

07/17/2020 ∙ by Xiaosu Tong, et al. ∙ Amazon 0

In this paper, we propose a streaming model to distinguish voice queries intended for a smart-home device from background speech. The proposed model consists of multiple CNN layers with residual connections, followed by a stacked LSTM architecture. The streaming capability is achieved by using unidirectional LSTM layers and a causal mean aggregation layer to form the final utterance-level prediction up to the current frame. In order to avoid redundant computation during online streaming inference, we use a caching mechanism for every convolution operation. Experimental results on a device-directed vs. non device-directed task show that the proposed model yields an equal error rate reduction of 41 on this task. Furthermore, we show that the proposed model is able to accurately predict earlier in time compared to the attention-based models.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The smart-home devices such as Amazon Echo, Google Home, etc. are often used in challenging acoustic conditions, such as a living room with multiple talkers and background media speech. In these situations, it is crucial for the device to respond only to the intended (referred to as device-directed (DD)) and ignore unintended (referred to as non device-directed (ND)) speech. We refer to “device-directed speech detection” as the binary utterance-level classification task, which can be tackled by a binary classifier trained with different types of features. Historically, two main types of features, acoustic features and features from Automatic Speech Recognition (ASR) decoding, are used in the studies of device-directed speech detection

[reich2011real, shriberg2012learning, yamagata2009system, lee2013using, wang2013understanding]. First of all, acoustic features such as energy, pitch, speaking rate, duration and the corresponding statistical summaries are considered in [shriberg2012learning]. Other acoustic features such as multi-scale Gabor wavelets are studied in [yamagata2009system]

. Secondly, features coming from ASR decoder such as ASR confidence scores and N-grams are also proved to be valuable for the detection task in

[shriberg2012learning, yamagata2009system]. Comparing to the acoustic features, however the ASR decoder features are computationally more expensive, and some of them may not even be available until the end of the utterance.

Our previous work [mallidi2018device, haung2019study] investigated the device-directed speech detection task and proposed a classifier that integrates multiple feature sources, including the acoustic embedding from a pretrained LSTM, speech decoding hypothesis and decoder features from an ASR model, into one single device-directed model. In this paper, we focus in particular on the task of learning utterance-level acoustic embeddings to improve the device-directed speech detection accuracy. We consider two aspects: a) the model topology and b) the aggregation method to convert a frame-wise into an utterance-level embedding.

As for aggregation methods, Norouzian et al. [norouzian2019exploring] showed the attention mechanism applied to the frame-wise output of the network can improve the equal error rate (EER) performance of the classifier. They used acoustic embedding features only and proposed a model topology consisting of a CNN and a bidirectional LSTM for the device-directed speech detection task. Kao et al. [kao2020comparison] compared different aggregation methods on top of the LSTM models for rare acoustic event classification, which is also an utterance classification task. The aggregation methods are applied to either the last hidden unit output or the soft label prediction .

Besides the aggregation mechanism, different model topologies for audio classification tasks are studied in [ford2019deep, bae2016acoustic, lim2017rare, cakir2017convolutional, guo2017attention, hershey2017cnn]. Cakır et al. [cakir2017convolutional] proposed a CRNN model structure for the sound event detection task, which is similar to the CLDNN model topology proposed in [sainath2015convolutional]. Since the results were evaluated at frame-level, no aggregation method was considered after the LSTM component. In [guo2017attention]

, the authors explored a CLDNN model with bidirectional LSTM combined with the attention aggregation for an acoustic scene classification task. Ford et al.

[ford2019deep] experimented with different ResNet [he2016deep] structures, and concluded that a 50-layer ResNet shows the best performance on an audio event classification task. In [bae2016acoustic], instead of stacking the LSTM on top of a CNN component, the authors proposed a parallel structure of LSTM and CNN components. Then, the outputs of LSTM and CNN are concatenated and fed into the fully connected layers.

In this paper, we evaluate the performance of different model topologies on the device-directedness task using acoustic features only and find the ResLSTM to outperform the ResNet, LSTM, or CLDNN model structures. Secondly, we propose a new mechanism to incorporate historical information within an utterance using frame-level causal mean aggregation. Compared to the attention method used in [haung2019study, norouzian2019exploring, kao2020comparison], the causal mean aggregation

  • is able to generate prediction at any frame and easily be applied for online streaming with much less computation.

  • has same performance as attention aggregation when evaluated at the end of an utterance.

  • outperforms the attention aggregation when evaluated at early time point of an utterance.

The rest of paper is organized as follows: Section 2 provides the overview of the main contribution of this paper. The network architectures and different aggregation methods are discussed with details in Section 3. Section 4 and 5 presents the experiments setup and correspondingly results. We conclude with Section 6.

2 Model architecture

In this section, we will discuss our network architectures and the aggregation methods.

2.1 Model Topologies

Our ResLSTM model consists of one convolutional layer and one batch norm layer followed by six residual blocks and one average pooling layer, as shown in Figure 1

. Each residual block has two convolution layers, two batch norm layers, and a residual connection. The second ReLU activation in the residual block is applied after the summation. There are 13 convolutional layers in total. The LSTM component has three unidirectional LSTM layers with 64 units. After the LSTM, there are two fully connected layers with hidden size 64.

Figure 1: The structure of the ResLSTM model

2.2 Aggregation

After the frame-level embeddings are generated from the network, an aggregation mechanism can be applied to the , and we categorize different aggregation methods into the following groups.

2.2.1 Simple aggregation

There are two types of simple aggregation considered in this paper. First, no aggregation used at all. During training, we do frame-wise backpropagation on every frame with the frame-level labels which are obtained by repeating the utterance label. During inference, we use the embedding of the last frame as the embedding of the entire utterance. Second one is the global mean aggregation, which calculate the mean of the embedding of all frames as the utterance embedding. Then, we backpropagate once for each utterance with the utterance-level label.

2.2.2 Attention aggregation

The attention aggregation calculates the utterance-level representation as a weighted average of all . Similar to the global mean aggregation, it uses utterance-level label during training. Our previous work [haung2019study] showed that the attention method has better performance than utterance-level embedding with global mean aggregation and frame-wise embeddings without any aggregation.

2.2.3 Causal mean aggregation

The drawback of the attention methods used in the previous work [haung2019study, norouzian2019exploring, kao2020comparison, ford2019deep] is that the attention weights for every frame are calculated once all frames are available, which is not feasible for online streaming tasks. Instead of generating the utterance-level representation at the end of the utterance, we generate frame-level representation by aggregating the past frames. Specifically, we average over all previous until current time point as the representation of the th frame. We call this the causal mean aggregation at frame-level:


During online inference, we implement the causal mean aggregation with a counter and a mean operation to express the logic as part of the neural network model definition in order to hide it from the inference engine. We find it convenient to use the LSTMs for this,

and . The LSTM structure, with state values, naturally allows to “side-loading” the frame count to both and . The is for the frame counting, and the is used for summation and division, which is shown in the Figure 2. The two LSTM components have only one layer with fixed weights shown in Equation 2 and Equation 3

, respectively. All the activation function

in the and components are the LeakyReLU with . The LSTM gates and weights are set as follows:


where , , and are weights and bias in the forget gate (), input gate (), output gate (), cell input () in the , respectively. is the output of the original LSTM component, and is the output of the component. Then, we concatenate the reciprocal of with the original as and feed into the . Let’s assume the dimension of is , and the component has one LSTM layer with hidden units.


where , , and are weights and bias in the forget gate (), input gate (), output gate (), cell input () in the , respectively. The is the element-wise product, is a matrix with all elements equal to and dimension . Similar to the idea showed in [kao2020comparison], we also move the aggregation component after the DNN and apply it to the instead of the .

The LSTM is not the only choice to the frame counter in our implementation. A one-layer RNN with LeakyReLU can be used to replace the :


where is the output of the original LSTM component, , , , and is initialized with 0. Same as , the output of the RNN, is also the frame index . However, the RNN cannot exactly converge to mimic the because the required weight values ( and in Equation 4) are time-dependent, which cannot be achieved by an RNN.

2.2.4 RNN aggregation

In the previous section, we showed how to calculate the embedding at each frame by causal mean aggregation, and potentially use LSTM or RNN as the frame counter in our implementation. Alternatively, one can use a trainable RNN layer as a different aggregation method besides the casual mean to get the aggregated embedding:


where and are the weights of the representation of the current frame and historical cumulation. The bias term is set to be 0. Instead of having the weight as fixed or predefined hyper parameter related to only, we use a one-layer RNN network to learn the weights for us. But the RNN layer potentially suffers from the gradient vanishing problem over time, which we will see in the result session.

2.3 Streaming CNN layer

In order to enable the model with convolutional operations for online streaming, we use a sliding window over the input of each convolutional layer, shifting in the time dimension during inference [streamcnn]. As the window is shifting to the right one frame at a time, we drop the oldest computation output from previous window, then cache and feed the rest of output into the next window of the same convolutional layer, which avoids wasting computes on redundant computations. We initialize the “previous” output at the first frame to zeros. As shown in Figure 3, we use one convolutional layer and one residual block with the first three frame inputs as an example for the online inference. For simplicity, we graph the the frequency dimension with size one.

Figure 2: The implementation of causal mean aggregation for online streaming
Figure 3:

Streaming convolutional operation. The shaded squares represent the initialized “previous” zero outputs at the first frame. The squares with a dash frame represent the frames of padded zeros, and the squares with a solid frame represent the frames of real input.

3 Experiments

We use real recordings of natural human interactions with voice-controlled far-field devices for training and testing the models. The training data consists of hours of audio data comprised of 6M utterances. 4M of the utterances are device-directed examples and the rest of 2M are non device-directed examples. The testing data consists of 35,000 utterances.

The ResLSTM model has a kernel size as

, and stride is

in which the time dimension stride is always 1. The output channel size of the first convolution layer is 8, and the following 6 residual blocks have as the corresponding output channel sizes. After the last average pooling layer, we flat out the frequency dimension with the channel dimension, and feed the outputs to the LSTM. We compared our ResLSTM model with LSTM, ResNet, and CLDNN models individually where we fix the aggregation component to be attention. Two LSTM models are considered here. The LSTM-S has 3 LSTM layers of 64 units which is used in [haung2019study]. The LSTM-L has 5 layers of 128 units, which is comparable to the ResLSTM model in terms of the number of parameters. The ResNet only model is similar to the one in the [hershey2017cnn]. We keep most of the ResNet50 [hershey2017cnn] setup the same except the following changes based on our preliminary experiments. First, we set all kernel sizes and strides to be and

, respectively. We also remove the max pooling layer after the first convolutional layer. Second, we reduce the channel sizes to be

in each residual block. The CLDNN model we used is similar to the one used in [guo2017attention]. It has 2 convolutional layers followed by one max pooling layer, and 5 LSTM layers with 128 units. All the convolutional layers are followed by a batch norm layer.

All our models are trained on the 256-dimensional log energy of short-time Fourier transform (log-STFT256) features. For global aggregation methods, such as attention and mean aggregation, the utterance label is used for loss calculation. We use Adam optimizer with the default setting

[torchoptimizer] to minimize the cross-entropy loss. We use low frame rate input which has 30ms for each frame. We truncated the input audio at 300 frames (9 seconds) length.

During the training, we feed the entire utterance input to the network. In order to match the training with the online streaming inference, we pad () on the left side of time dimension of the input for every convolutional layer during training. Therefore, all the convolutional layers only see their corresponding inputs from the past but not future frames. We specify the stride in the time dimension of all convolutional layers to be 1.

4 Results

We first compare the performance across different model topologies. In the Table 1, we include AUC (area under curve), EER (equal error rate), and ACC (accuracy) as our performance metrics. We also show the number of parameters of each model in the Table 1 We use the results of LSTM-S model as the baseline. Increasing the width and depth of the LSTM-S to LSTM-L does reduce the EER by and the number of parameters is increased from 0.3M to 1M. The CNN component in the CLDNN on top of the LSTM-L improves the EER by . We also find simply adding more CNN layers in the CLDNN structure does not help to improve the performance on the test dataset. The ResNet only model has 50 convolutional layers, and it improves the EER by relatively comparing to the baseline. But the number of parameters is about 1.5M, which is larger than other model topologies. Finally the ResLSTM model, which has 0.9M parameters, improves the EER the most by .

topology AUC EER ACC Para
LSTM-L +7.6% -22.6% +5.8% 1.0M
CLDNN +9.7% -30.0% +6.0% 1.1M
ResNet +11.9% -38.2% +8.8% 1.5M
ResLSTM +12.2% -41.1% +8.7% 0.9M
Table 1: Performance of different model topology with attention aggregation

Next, we fix the model topology to be the ResLSTM and compared different aggregation methods including frame-level training without any aggregation 2.2.1, utterance-level attention 2.2.2 and global mean, causal mean 2.2.3 and one layer RNN 2.2.4. We use the results of the ResLSTM without aggregation as the baseline. We applied the causal mean aggregation on either the LSTM output or the prediction output from the DNN . Results are shown in the Table 2. As expected, the performance of global mean aggregation and attention method improves the EER by and , respectively, which matched the finding in our previous work [haung2019study]. The two models with frame-level causal mean aggregation show similar EER performance, which improves the EER by and . We conclude that the model with causal mean aggregation can achieve similar performance as model with attention when evaluate at the end of utterances. Since there is no significant performance difference between the two causal mean methods, we will use causal mean aggregation on the for the rest of the paper. Using one layer RNN with activation function slightly improves the EER by comparing to the baseline. RNN with ReLU activation function performed even worse, increases the EER by comparing to the baseline. We believe this is due to the gradient vanish issue of the RNN layer over time.

aggregation method AUC EER ACC
global mean +1.3% -7.8% +1.0%
attention +1.3% -9.0% +0.6%
causal mean on +1.3% -8.5% +1.1%
causal mean on +1.5% -7.8% +1.7%
RNN-ReLU 0% +2.4% -1.0%
RNN-tanh +0.2% -0.6% 0%
Table 2: ResLSTM model with different aggregation methods

Instead of evaluating the prediction results at the end of utterance, we also compared the causal mean aggregation to the attention by evaluating the prediction at early frames in an utterance. In Table 4, we evaluate the model performance in the first several seconds of each utterance. The causal mean always performance better than the attention method in terms of EER, especially when evaluating at the first two seconds. Moreover, in Table 3, we compare the two aggregation methods by evaluating the prediction results at different portions of each utterance. For example, means the middle of an utterance. The causal mean aggregation method still consistently outperforms the attention method. Especially evaluating at middle of the utterance, it reduces the EER by comparing to the attention method.

aggregation method
causal mean on -16.0% -13.3% -7.6% -3.6% +0.6%
Table 3: EER at different relative time point of the utterance. is the full length of an utterance

This robustness property of the causal mean aggregation method is critical for streaming ASR applications for two reasons: Firstly, in practice, the end of the utterance is determined by a separate end-of-utterance detector (aka, end-pointer) for the purpose of ASR and can therefore vary significantly from utterance to utterance. Secondly, depending on the application, an early DD/ND decision can be desirable in order to take action prior to reaching the end of the utterance.

aggregation method
causal mean on -4.6% -24.7% -4.1% -3.3% -2.8%
Table 4: EER at different time in seconds since beginning of the utterances.

We also tried a causal attention aggregation method, which mask out the future frames for the attention calculation at each frame. But the computational cost is prohibitive, since at every frame, the attention calculation has to be repeated which is much more computationally expensive than causal mean aggregation. We will consider this as a future work to continue seeking solution to reduce the training time.

5 Conclusions

In this paper, we proposed a ResLSTM model with causal mean aggregation for online streaming classification of device-directed speech detection. Experimental results showed that the ResLSTM model topology outperforms other topologies such as LSTM, ResNet, and CLDNN. We showed how to cache convolutional operations for online streaming inference with CNNs. We also proposed a causal mean aggregation method to obtain a more robust frame-level representation, and showed that causal mean aggregation method can achieve the same performance as the attention aggregation method on full utterances and significantly outperforms attention when used for early decision making, prior to reaching the end-of-utterance.