Log In Sign Up

I see what you hear: a vision-inspired method to localize words

by   Mohammad Samragh, et al.

This paper explores the possibility of using visual object detection techniques for word localization in speech data. Object detection has been thoroughly studied in the contemporary literature for visual data. Noting that an audio can be interpreted as a 1-dimensional image, object localization techniques can be fundamentally useful for word localization. Building upon this idea, we propose a lightweight solution for word detection and localization. We use bounding box regression for word localization, which enables our model to detect the occurrence, offset, and duration of keywords in a given audio stream. We experiment with LibriSpeech and train a model to localize 1000 words. Compared to existing work, our method reduces model size by 94


page 1

page 2

page 3

page 4


SpeechYOLO: Detection and Localization of Speech Objects

In this paper, we propose to apply object detection methods from the vis...

Boundary Distribution Estimation to Precise Object Detection

In principal modern detectors, the task of object localization is implem...

Side-Aware Boundary Localization for More Precise Object Detection

Current object detection frameworks mainly rely on bounding box regressi...

3D Object Class Detection in the Wild

Object class detection has been a synonym for 2D bounding box localizati...

Wake Word Detection Based on Res2Net

This letter proposes a new wake word detection system based on Res2Net. ...

Gaussian Processes with Context-Supported Priors for Active Object Localization

We devise an algorithm using a Bayesian optimization framework in conjun...

HEiMDaL: Highly Efficient Method for Detection and Localization of wake-words

Streaming keyword spotting is a widely used solution for activating voic...

1 Introduction

Recent advancements in automatic speech recognition (ASR) technologies have made human machine interaction seamless and natural 

[20, 2, 19]. ASR systems are designed to accurately comprehend any speech data, hence, they are often computationally expensive, power hungry, and memory intensive. Acoustic models with limited-vocabulary can only recognize certain words, but they offer computationally efficient solutions [12, 1, 16]. In this paper we focus on the latter.

Limited-vocabulary models are important for several reasons. First, users often interact with their devices using simple commands [15], and recognizing these commands may not necessarily require an ASR model. Second, limited-vocabulary models are needed to detect trigger phrases, e.g., “Alexa, OK Google, hey Siri”, which indicate that a user wants to interact with the ASR model. A limited vocabulary model should (a) accurately recognize if certain words are spoken, and (b) precisely locate the occurrence time of the words. The latter is rather important due to privacy and efficiency reasons, as an ASR model should only be queried/executed when users intend to interact with it. Additionally, a limited-vocabulary model with localization capabilities can improve the accuracy of an ASR model by providing noise-free payloads to it.

A plethora of existing work focus on keyword detection [12, 1, 16]

, yet, efficient and accurate word localization needs more investigation. In computer vision, object localization has been solved using bounding box detection 

[11, 7, 21]. Since an audio can be interpreted as a 1-D image, similar techniques can be used in principle to localize words in an audio. The first effort in this track is SpeechYolo [13], which shows the great potential of using visual object detection techniques for word localization.

This paper presents an alternative vision-based word localizer. In our design, we pay attention to several important properties: (a) having small memory footprint, (b) the ability to process streaming audio, (c) accurate word detection and (d)

 accurate word localization. To achieve these goals, we propose customized metrics that indicate the presence or absence of words in the input audio. We then devise appropriate loss functions and train our model to perform three tasks simultaneously on streaming audio: detection, classification, and localization. We experiment with LibriSpeech and train a model to localize 1000 words. Compared to SpeechYolo, our model is

smaller, is capable of processing streaming audio, and achieves better F1 score.

2 Related work

The task of detecting limited vocabulary has been studied in the keyword spotting literature [12, 1, 16], where the goal is to detect if a keyword is uttered in a segmented audio. Our problem definition is more challenging as we expect our model to precisely localize keywords in a streaming audio. In addition, the majority of existing literature target a small vocabulary, e.g., 12-36 words in Google Speech Commands [18], whereas we show scalability to 1000 words in our experiments.

Post processing approaches can be utilized for word localization. In DNN-HMM models [14], for instance, the sequence of predicted phonemes can be traced to find word boundaries. Another example is [10]

where word scores are generated by a CNN model for streaming audio segments, and the location of the words can be estimated accordingly. Since the above models are not directly trained for localization, their localization performance is likely not optimal. A more precise localization can be obtained by forced alignment after transcribing the audio using an ASR model 

[8], or by coupling an ASR model with a CTC decoder [4]. However, these solutions are computationally expensive.

Recently, principles from Yolo object detection [11] have been used for speech processing, which incorporate both detection and accurate localization into the model training phase. YOHO [17] can localize audio categories, e.g., it can distinguish music from speech. A more related problem definition to ours is studied in SpeechYolo [13], which shows great potential of vision-based localization techniques by localizing words in segments of 1-second audio. In this paper, we present a more carefully designed word localizer that is capable of processing streaming audio. Additionally, we show that our design yields better detection and localization accuracy with smaller memory footprint.

3 Problem formulation

We aim to detect words in the lexicon of

-words, indexed by . Let be ground-truth events in an utterance . Each event contains the word label , the event beginning time , and the event ending time . Our goal is to train parameterized by that predicts proposals:


The recognizer model should be trained such that the predicted events match the ground truth events.

4 Methodology

The overall flow of our word localization is shown in Figure 1. The CNN backbone converts the input audio into a feature matrix . The rest of the modules use this feature matrix to detect events (encoded by

), classify event types (encoded by

), and predict the event offset and length . Finally, () are processed by the utterance processing module and events are proposed as .

Figure 1: high-level overview of our word localization model.

Backbone. The CNN model receives an utterance and converts it into a feature matrix . The rows of correspond to utterance segments , which we denote by in the remainder of the paper for notation simplicity. Here, is the length of each segment (a.k.a., the network’s receptive field) and

is the shift between two consecutive segments (a.k.a. the network’s stride). In an utterance of length

, there are total segments.

Event detection. The ground-truth event detection label is a binary matrix , where specifies whether contains the -th word in the lexicon. To assign these hard-labels, we compute the intersection over ground-truth (iog) metric. Let be a ground-truth event. We compute:


We threshold the metric and assign labels accordingly:

  • if , word is almost perfectly contained in ; In this case we have .

  • if , word is not contained in ; In this case we have .

  • if , word is partially contained in and partially outside . The classification label is “don’t care” in this case.

Figure 2 illustrates several examples for the above cases.

Figure 2: Examples of positive (green highlight), negative (red highlight), and “don’t care” samples (grey highlight).

Event detection loss

. We compute event probabilities as

. Our goal here is to encode the presence of word in as . If contains word , should be close to 1. To enforce this behaviour, we define the positive loss:


where the numerator is the sum of binary cross entropy (BCE) loss over all elements with a ground truth label of 1, and the denominator is a normalizer that counts the number of positive labels. When word is not in , should be close to 0. To enforce this behaviour, we define the negative loss:


Localization loss. to predict event begin and end times, we adopt the CenterNet approach from visual object detection literature [21]. Let be the center of , and be the beginning and end of an event. We define the offset of the event as , and the length of the event as . If our model can predict and accurately, we can calculate , . We generate offset prediction and length prediction . During training we minimize the normalized L1 distance between the ground-truth and predicted values. Equation 5 shows the offset loss function . The length loss function is defined similarly.


The predictions defined up to here can be used to generate region proposals for possible events. We initially applied non-maximum suppression (NMS), a technique widely used in image object localization [9], to select the best non-overlapping proposals but many proposals were wrong. Our investigations unveiled two reasons for this matter:

  • collision: may contain multiple words, thus, might be non-zero for multiple . This makes it unlikely for NMS to make a correct proposal.

  • confusion: even if there is only a single event within , it might get confused, e.g., our model might confuse a ground-truth word “two” with “too” or “to”.

Classifier. To address the collision and confusion issues stated above, we propose to train a classifier that predicts:


Here, predicts the probability that the : (a) contains the -th word in the lexicon, for , or (b) does not contain any of the words in the lexicon, for . The operator in Equation 6 denotes element-wise multiplication, and

is a binary tensor:


With the above masking formulation, our Softmax layer does not need to solve a

classification problem. Instead, it solves the problem only for the positive classes proposed by and the negative class. We train the softmax classifier using the cross-entropy loss:


where is the ground-truth label and is the cross-entropy loss. The ground truth

is a one-hot vector that indexes either one of the

word classes or the negative class. In cases that contains more than one ground-truth event, indexes the event with smallest offset . In essence, the Softmax classifier either rejects the events suggested by or selects one event (which has maximum probability) from them.

Total loss. The training loss is given by:

(a) ground-truth events
(b) detection signal
(c) classification result
(d) NMS proposed events
Figure 3: Example input and corresponding outputs.

Utterance processing (inference only). Once the model is trained, we can query it with an audio and receive proposed events. Figure 3-(a) shows the ground-truth events for an example audio, and the non-zero columns of the predicted are plotted in Figure 3-(b). The detection is then used in the classifier (Eq. 6) to compute , illustrated in Figure 3-(c). At frame , if the maximum prediction is larger than some threshold , an event is proposed as:


where and are the estimated event begin and event end times. The extracted events are a set of overlapping windows in the time domain, each of which has a score. To suppress repetitive proposals, we use NMS. The NMS output is illustrated in Figure 3-(d); as seen, the proposed events are quite accurate compared to the ground-truth events.

5 Evaluations

Module dim k s d
Input - [13200] - - -
Feature Extractor FBank [1, 40, 81] 400 160 1
BC-ResNet Conv2d [256, 20, 77] (5, 5) (2,1) (1,1)
Transition [128, 20, 75] 3 (1,1) (1,1)
Normal [128, 20, 73] 3 (1,1) (1,1)
Transition [192, 10, 69] 3 (2,1) (1,2)
Normal [192, 10, 65] 3 (1,1) (1,2)
Transition [256, 5, 57] 3 (2,1) (1,4)
Normal [256, 5, 49] 3 (1,1) (1,4)
Normal [256, 5, 41] 3 (1,1) (1,4)
Normal [256, 5, 33] 3 (1,1) (1,4)
Transition [320, 5, 17] 3 (1,1) (1,8)
Normal [320, 5, 1] 3 (1,1) (1,8)
Conv2d [128, 1, 1] (5, 1) (1,1) (1,1)
Feature (z) - [1, 128] - - -
Table 1: Model architecture used in our experiments. Here, “dim” is layer output dimension (ignoring batch dimension, assuming the input audio length is equal to the receptive field), “k” is kernels size, “s” is stride, and “d” is dilation.

Architecture. We use the BCResNet architecture [5] to implement the backbone in Figure 1. The input to the model is raw audio sampled at . The receptive field is ( ms) and stride is ( ms). The layers and output dimensions per layer are shown in Table 1 when the raw audio length is equal to R. The entire CNN backbone encodes each segment as a 128-dimensional vector.

Dataset. Similar to prior work [13]

, we use the Montreal Forced Aligner to extract the start and end time of words in LibriSpeech dataset. The lexicon is similarly chosen as 1000 words that appear the most in the training set.


We train the network for 100 epochs with the Adam optimizer 

[6] and a Cosine Annealing scheduler that starts from a learning rate of and gradually reduces it to

by the end of training. During training, we randomly cut a small portion from the beginning f each utterance to make the model agnostic to shifts. No other data augmentation is applied during training. The batch size in our experiments is 32 per GPU, and we use Pytorch distributed data parallelism to train our model on 8 NVIDIA-V100 GPUs.

5.1 Word detection and localization

For evaluation, we run NMS on events that have to obtain the proposed events. We then compute true positives (TP) as the number of proposed events that overlap with a ground-truth event of the same class. If a ground-truth event is not predicted, we count it as a false negative (FN). If a proposed event is not in the ground truth events or the predicted class is wrong we count the event as a false positive. We then compute precision, recall, F1-score, actual accuracy, and average IOU the same way SpeechYolo does [13]. Table 2 summarizes the performance comparison between our method and SpeechYolo. Here, “Ours-L” represents the model in Table 1 and “Ours-S” is the same network with half number of features per layer. We report the SpeechYolo performance numbers from the original paper111We were not able to reproduce their results as the authors do not provide pre-processed inputs in their github repository. Our large model is 94% smaller than SpeechYolo. It can process arbitrary-long audios compared to 1-second segments in SpeechYolo. Our F1-score, IOU, and actual accuracy are also consistently higher.

Data Method
Prec. Recall F1
test_clean SY [13] 108 MB 0.836 0.779 0.807 0.774 0.843
Ours-L 6.2 MB 0.863 0.880 0.872 0.873 0.857
Ours-S 2.1 MB 0.852 0.770 0.809 0.759 0.855
test_other SY [13] 108 MB 0.697 0.553 0.617 - -
Ours-L 6.2 MB 0.764 0.713 0.738 0.704 0.850
Ours-S 2.1 MB 0.777 0.549 0.643 0.531 0.849
Table 2: Comparison of our trained models with SpeechYolo. In each column, metrics that outperform SpeechYolo are bold. The decision threshold is separately tuned for SpeechYolo and our work () to maximize the F1-scores.

The better performance of our work is due to the fact that we customize the underlying detector design (Figure 1) specifically to overcome challenges in acoustic modeling. To illustrate the effect of different aspects of our design, we perform an ablation study with the large model in Table 3. In summary, improves precision, improves recall, and improve the localization capability (IOU).

() Precision Recall IOU
Yes No Yes 0.338 0.871 0.807
No Yes Yes 0.835 0.652 0.848
Yes Yes No 0.892 0.522 0.396
Yes Yes Yes 0.863 0.880 0.857
Table 3: Effect of the predicted signals on performance.

5.2 keyword spotting

In the next analysis, we compare our method in keyword spotting where the goal is to accurately identify a limited set of words, i.e., 20 words defined in Table 2 of  [10] and used by speechYolo. The evacuation metric here is Term Weight Value (TWV) [3] defined as follows:


where refers to the keywords, is the probability of missing the -th keyword, and is the probability of falsely detecting the -th keyword in 1 second of audio data. Here, is a constant that severely penalizes the TWV score for high false alarm rates. Table 4 compares our keyword spotting performance with SpeechYolo, where the is tuned for each keyword such that the maximum TWV value (MTWV) is achieved. As seen, our method outperforms SpeechYolo.

Partition SY [13] Ours-L Ours-S
test_clean 0.74 0.80 0.74
test_other 0.38 0.64 0.53
Table 4: MTWV scores for 20-keyword spotting application. Bold numbers are the ones that outperform SpeechYolo.

6 Conclusion

This paper proposes a solution to limited-vocabulary word detection and localization in speech data. We devise model components that ensure a high precision, recall, and localization score. We then define the loss functions required to train the model components. We showed in our experiments that, compared to existing work, our model is more accurate, smaller in size, and capable of processing arbitrary length audio.


  • [1] R. Alvarez and H. Park (2019) End-to-end streaming keyword spotting. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6336–6340. Cited by: §1, §1, §2.
  • [2] Y. Chung, Y. Zhang, W. Han, C. Chiu, J. Qin, R. Pang, and Y. Wu (2021) W2v-bert: combining contrastive learning and masked language modeling for self-supervised speech pre-training. In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 244–250. Cited by: §1.
  • [3] J. G. Fiscus, J. Ajot, J. S. Garofolo, and G. Doddingtion (2007) Results of the 2006 spoken term detection evaluation. In Proc. sigir, Vol. 7, pp. 51–57. Cited by: §5.2.
  • [4] A. Graves (2012) Connectionist temporal classification. In

    Supervised sequence labelling with recurrent neural networks

    pp. 61–93. Cited by: §2.
  • [5] B. Kim, S. Chang, J. Lee, and D. Sung (2021) Broadcasted residual learning for efficient keyword spotting. arXiv preprint arXiv:2106.04140. Cited by: §5.
  • [6] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §5.
  • [7] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg (2016) Ssd: single shot multibox detector. In European conference on computer vision, pp. 21–37. Cited by: §1.
  • [8] M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Sonderegger (2017-08) Montreal forced aligner: trainable text-speech alignment using kaldi. pp. 498–502. External Links: Document Cited by: §2.
  • [9] A. Neubeck and L. Van Gool (2006) Efficient non-maximum suppression. In

    18th International Conference on Pattern Recognition (ICPR’06)

    Vol. 3, pp. 850–855. Cited by: §4.
  • [10] D. Palaz, G. Synnaeve, and R. Collobert (2016) Jointly learning to locate and classify words using convolutional networks.. In Interspeech, pp. 2741–2745. Cited by: §2, §5.2.
  • [11] J. Redmon and A. Farhadi (2018) YOLOv3: an incremental improvement. arXiv. Cited by: §1, §2.
  • [12] O. Rybakov, N. Kononenko, N. Subrahmanya, M. Visontai, and S. Laurenzo (2020) Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720. Cited by: §1, §1, §2.
  • [13] Y. Segal, T. S. Fuchs, and J. Keshet (2019) SpeechYOLO: detection and localization of speech objects. Proc. Interspeech 2019, pp. 4210–4214. Cited by: I see what you hear: a vision-inspired method to localize words, §1, §2, §5.1, Table 2, Table 4, §5.
  • [14] A. Shrivastava, A. Kundu, C. Dhir, D. Naik, and O. Tuzel (2021) Optimize what matters: training dnn-hmm keyword spotting model using end metric. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4000–4004. Cited by: §2.
  • [15] R. Tang, K. Kumar, J. Xin, P. Vyas, W. Li, G. Yang, Y. Mao, C. Murray, and J. Lin (2022) Temporal early exiting for streaming speech commands recognition. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7567–7571. Cited by: §1.
  • [16] R. Tang and J. Lin (2018) Deep residual learning for small-footprint keyword spotting. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5484–5488. Cited by: §1, §1, §2.
  • [17] S. Venkatesh, D. Moffat, and E. R. Miranda (2022) You only hear once: a yolo-like algorithm for audio segmentation and sound event detection. Applied Sciences 12 (7), pp. 3293. Cited by: §2.
  • [18] P. Warden (2018) Speech commands: a dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209. Cited by: §2.
  • [19] Q. Xu, A. Baevski, T. Likhomanenko, P. Tomasello, A. Conneau, R. Collobert, G. Synnaeve, and M. Auli (2021) Self-training and pre-training are complementary for speech recognition. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3030–3034. Cited by: §1.
  • [20] Y. Zhang, J. Qin, D. S. Park, W. Han, C. Chiu, R. Pang, Q. V. Le, and Y. Wu (2020)

    Pushing the limits of semi-supervised learning for automatic speech recognition

    arXiv preprint arXiv:2010.10504. Cited by: §1.
  • [21] X. Zhou, D. Wang, and P. Krähenbühl (2019) Objects as points. arXiv preprint arXiv:1904.07850. Cited by: §1, §4.