Query-by-example spoken term detection (QbE-STD) is defined as the task of detecting all files from an audio archive which contain a spoken query provided by a user (see Figure 1
). It enables users to search through multilingual audio archives using their own speech. The primary difference from keyword spotting is that QbE-STD relies on spoken queries instead of textual queries making it a language independent task. In general, the queries and test utterances are generated by different speakers in different languages with varying acoustic conditions and without constraints on vocabulary, pronunciation lexicon, accents etc. Thus, the search is performed relying only on acoustic data of the query and test utterances with no language specific resources, as a zero-resource task. It is essentially a pattern matching problem in the context of speech data where the targeted pattern is the information represented using speech signal and given to the system as a spoken query.
A QbE-STD system finds great application in searching through multimedia content produced by news agencies, radio broadcast channel, internet, social media etc. These contents are massive and are generally produced by a large diverse group of people in multiple different languages. The search through this data still relies on its textual description which may not be always available or may be insufficient for representing the complete contents of data. Therefore, text based retrieval algorithms give very limited search results. Moreover, it is desirable to search through those contents using speech as a natural and generic medium of communication.
State-of-the-art QbE-STD systems primarily rely on DTW based template matching techniques to find the spoken queries in test utterances. This approach involves the following two steps: (i) extraction of suitable feature vectors from the queries and test utterances, (ii) employing those features to estimate the likelihood of the query occurring somewhere in the test utterance as a sub-sequence. Spectral features[16, 6]
, posterior features (posterior probability vector for phone or phone-like units)[36, 27] as well as bottleneck features (representation obtained from the bottleneck layer of a neural network) [33, 9] have been used for this task. The matching likelihood is generally obtained using a dynamic time warping (DTW) algorithm on the frame-level similarity matrix computed from the feature vectors of the query and each audio document. Several variants of DTW have been proposed to deal with sub-sequence detection problem: Segmental DTW [36, 16], Slope-constrained DTW , Sub-sequence DTW , Subspace-regularized DTW [21, 23] etc.
Previously in 
, we proposed to cast the template matching problem as binary classification of images. Feature vectors from the spoken query and test utterances are used to compute frame-level similarities in a matrix form. This matrix contains a quasi-diagonal pattern if the query occurs in the test utterance. A convolutional neural network (CNN) based classifier is trained to identify the pattern and make a decision about the occurrence of the query. This approach is shown to perform significantly better than the best DTW based system using concatenation of multiple monolingual phone posteriors.
In this work, we use bottleneck feature representation instead of posterior features as it has been shown to perform better with DTW based matching . The monolingual features used in those cases suffer from the language mismatch problem during DNN based feature extraction. To deal with this problem, we train multilingual networks aimed at obtaining language independent representation. These multilingual bottleneck features are used for both DTW and CNN based matching. Finally, we integrate the representation learning and CNN-based matching to jointly train and further improve the QbE-STD performance. Different components of this system are implemented separately to analyze their performance before building the end-to-end system. The contributions of this paper is summarized in the following:
Representation Learning (Section III): In contrast to using several language dependent bottleneck features for QbE-STD, here we propose to train multilingual bottleneck networks to estimate language independent representation of the query and test utterances. This is achieved by using multitask learning principle  to jointly classify phones from multiple languages and the shared network is able to learn language independent representation. These representations are employed to estimate the query detection likelihood using both DTW (Section IV) and CNN based Matching.
CNN based Matching (Section V): The DTW based template matching is applied on a frame-level similarity matrix computed from the feature vectors of the query and the test utterance to estimate the likelihood score of occurrence. Unlike DTW, we view the similarity matrix as an image and propose to approach the QbE-STD problem as an image classification task. We observe that the similarity matrix contains a quasi-diagonal pattern if the query occurs in the test utterance. Otherwise, no such pattern is observed. Thus for each spoken query, a test utterance can be categorized as an example of positive or negative class depending on whether the query occurs in it or not.
End to End QbE-STD System (Section VI): The proposed neural network based end-to-end system takes spectral features (MFCC) corresponding to a query and a test utterance as input, and the output indicates whether the query occurs in the test utterance. It has three components: (i) Feature extraction, (ii) Similarity matrix computation and (iii) CNN based matching, combined into one architecture for end-to-end training. The feature extractor aims at obtaining language independent representation to produce better score for similarity matrix which in turn improves the CNN based matching.
The proposed end-to-end QbE-STD system has the following advantages over the baseline DTW based approach: (i) the CNN based matching provides a learning framework to the problem (ii) the CNN considers the whole similarity matrix at once to find a pattern, whereas the DTW algorithm takes localized decisions on the similarity matrix to find a warping path, (iii) the CNN based matching introduces a discrimination capability in the system and (iv) the end-to-end training enables joint optimization of the representation learning and the matching network.
The proposed methods are evaluated on SWS 2013 database and their generaliaztion ability is analyzed on QUESST 2014 database as described in Section VIII. The significant improvements obtained using these approaches show the importance of a learning framework for QbE-STD. Finally, we present the conclusions in Section IX.
Ii Prior Works
We summarize various approaches for spoken query detection in this section. Most of the successful approaches can be combined into a category called template matching, consisting of two primary steps: (i) feature extraction and (ii) matching likelihood computation. Generally, suitable feature vectors are estimated from both the spoken queries and test utterances before computing the matching likelihood between them using some variation of dynamic programming algorithm. Spectral features like Mel frequency cepstral coeffcient (MFCC) or perceptual linear prediction (PLP) have been used used with limited success. Posterior features estimated from Gaussian mixture model (GMM) as well as deep neural network (DNN) [27, 11] yields better performance. The GMMs are generally trained in an unsupervised way where the output indicates posterior probabilities of different Gaussian components in the model [16, 36]. On the other hand, the DNNs are trained in a supervised manner using labeled data form several well resourced languages and the outputs can be posteriors of monophones, context dependent phones or senones [10, 27]. The output of the DNN is considered as an instantaneous characterization of the speech signal irrespective of the input language. Enhanced phone posteriors and phonological posteriors have also been used as speech representation [24, 3]. Both supervised and unsupervised bottleneck features from DNNs have been used for query detection [33, 9, 8].
The features extracted from the spoken query and test utterance are used to compute a frame-level distance matrix (using a suitable distance metric e.g. euclidean, cosine etc). A dynamic time warping (DTW) algorithm can be used to find the least cost path through this distance matrix to determine a frame level mapping between the query and test utterance, and the accumulated cost indicates the degree of match. However, this standard DTW performs matching between two complete temporal sequences making it unsuitable for subsequence matching in our case. Segmental DTW [16, 36] deals with this problem by constraining the warping path in a predefined window. But it cannot handle utterances with large speaking rate variation, which can be solved using slope-constrained DTW . It penalizes the slope of the warping path by limiting the number of frame mappings between the query and the test utterance. In sub-sequence DTW , the algorithm forces the cost of insertion at the beginning and end of the query to be 0, thus encouraging the warping path to start and end at any frame of the test utterance and gives us a sub-sequence matching the spoken query.
Alternative to DTW, subspace modeling of queries are used to compute a frame level score for faster detection [22, 21]. These subspace scores are also used to regularize the distance matrix for DTW to boost the performance [23, 21]. The template matching can also be performed with a Convolutional Neural Network (CNN) while using the distance matrix as an input image to find the warping path and achieve higher accuracy 
. Additionally, the problem of acoustic and speaker mismatch is mitigated using model based approaches. These methods use hidden Markov models (HMM) to model acoustic units which are derived in an unsupervised manner. The queries and test utterances are represented using those HMMs and symbolic search techniques are used to retrieve test utterances containing the query[6, 14].
Iii Representation Learning
In this section, we discuss different monolingual and multilingual bottleneck features used for spoken query detection. Bottleneck features are low-dimensional representation of data generally obtained from a hidden bottleneck layer of a DNN [35, 34, 33]
. This layer has a smaller number of hidden units compared to other layers, which constrains the information flow through the network during training. It enables the network to focus on the essential information from data for minimizing the final loss function. In the following, we present the DNN architectures used to obtain different types bottleneck features.
Iii-a Monolingual Neural Network
We train DNNs for phone classification using five languages to estimate five distinct monolingual bottleneck features. The DNN architecture consists of 3 fully connected layers of 1024 neurons each, followed by a linear bottleneck layer of 32 neurons, and a fully connected layer of 1024 neurons. The final layer feeds to the output layer of sizecorresponding to number of classes (e.g. phones) of the -th language. The architecture is presented in Figure 2.
The monolingual bottleneck features have previously been shown to provide good performance for this task . Here, we analyze their performance and further train multilingual networks to estimate better features for QbE-STD.
Iii-B Multilingual Neural Network
Multilingual neural networks have been studied in the context of ASR in order to obtain language independent representation of speech signal . Those networks are trained using multitask learning principle  which aims at exploiting similarities across tasks resulting in an improved learning efficiency when compared to training each task separately. Generally, the network architecture consists of a shared part and several task-dependent parts. In order to obtain multilingual bottleneck features we model phone classification for each language as different tasks, thus we have a language independent part and a language dependent part. The language independent part is composed of the first layers of the network which are shared by all languages forcing the network to learn common characteristics. The language dependent part is modeled by the output layers (marked in orange in Figure 2), and enables the network to learn particular characteristics of each language.
In this work, we train two different multilingual networks using 3 languages and 5 languages respectively in order to analyze the effect of training with additional languages. The architecture of these networks are presented in Figure 2 and described in the following.
Multilingual (3 languages): this architecture consists of 4 fully connected layers having 1024 neurons each, followed by a linear bottleneck layer of 32 neurons. Then, a fully connected layer of 1024 neurons feeds to 3 output layers corresponding to the different training languages. The 3 output layers are language dependent while the rest of the layers are shared among the languages.
Multilingual (5 languages): this architecture is similar to the previous one except it uses an additional fully connected layer of 1024 neurons, and two extra output layers corresponding to the 2 new languages. The increased number of layers is intended at modeling the extra training data gained by adding languages.
Iv DTW based Template Matching
The trained neural networks discussed in previous section is used to extract different types of bottleneck features for DTW based template matching. The features from the query examples are used to construct reference templates to match with the test utterances as discussed below.
Iv-a Query Template Construction
We construct a query template in two ways depending on the number of examples available: (i) one, (ii) more than one. In the first case, the feature vectors constitute the reference template. In other case, we construct an average template using different examples of the same query. For this purpose, we select the example with highest number of frames as reference template and use DTW  to obtain a frame level mapping between the reference and rest of the examples. The frames mapped together are averaged to compute the final template for matching [27, 7].
Iv-B Template Matching
The DTW algorithm proposed in  for query detection is used as our baseline system. It was the best system for Spoken Web Search (SWS) in MediaEval challenge 2013 . The basic framework of the system is presented in Figure 3 and is briefly discussed below.
The features of queries and test utterances are used to compute a frame level distance matrix using cosine distance . A DTW algorithm (similar to the slope-constrained DTW ) is performed on this distance matrix to find the optimal cost path. The cost is normalized at each step using the partial path length and constraints are imposed to let the warping path begin and end at any point in the test utterance. It gives us a sub-sequence of the test utterance that optimally matches the query and the corresponding likelihood score. The resulting sub-sequences are filtered depending on their lengths to reduce the false alarms. The likelihood scores are compared with a predefined threshold to make final decision.
V CNN based Matching
The DTW based template matching for query detection can be replaced with a CNN by casting the problem as a binary classification of images ; where the images are similarity matrices between the queries and test utterances. In the following, we describe this method, including the process of image construction and our CNN architecture.
V-a Image Construction
The input to the CNN is composed of similarity matrices calculated between the queries and test utterances. These matrices form a quasi-diagonal pattern in the regions where a query and test utterance match. This is caused by the high similarity values that such regions represent (see the yellow pattern in Fig 4). To calculate the similarity matrices, first we extract bottleneck features (Section III) from both spoken queries and test utterances using MFCC features as input. Let us consider, and representing the features of a spoken query and a test utterance respectively, where and
are the number of frames in each case. We compute cosine similarity between two feature vectors and , as follows:
Then, we apply a range normalization to constrain the values in the range .
We define two categories of images: (i) positive class, when the query occurs in the utterance, and (ii) negative class otherwise. Figures 4 and 5 show examples of these classes. The vertical and horizontal axis represent the frames of the query and test utterance respectively. The strength of values are shown with colors, yellow for high values and blue for low ones.
Here we present a CNN architecture used to classify the similarity matrices defined in the previous section. The architecture is similar to a VGG network 
that performs well in image classification task. It consists of a series of convolution and max-pooling layers with fixed sized filters and numbers of feature maps for all layers, simplifying the hyperparameter selection process. We have one channel similarity matrix as input instead of the three channel RGB color images generally used in standard image classification tasks. The detailed architecture is described in TableI where convolution layers use ReLU 
as activation function. The number of channels and dropout were optimized to 30, and 0.1 respectively with a development set. The training label for the network indicates whether the query occurs in a test utterance corresponding to a input similarity matrix. The training data can be constructed from any pair of spoken queries and test utterances from any language with minimal supervision, as we only need the information if a query is part of the test utterance, without requiring the full transcription. Note that, we also performed experiments with simpler architectures and expected good performance due to the simplicity of the task. However, those experiments with less number of layers failed to outperform the baseline system.
Channel: in=1, out=1, Filter: 2x2, Stride: 2
|Conv||Channel: in=1, out=30, Filter: 3x3, Stride: 1|
|Conv||Channel: in=30, out=30, Filter: 3x3, Stride: 1|
|Maxpool||Channel: in=30, out=30, Filter: 2x2, Stride: 2|
|Conv||Channel: in=30, out=30, Filter: 3x3, Stride: 1|
|Conv||Channel: in=30, out=30, Filter: 3x3, Stride: 1|
|Maxpool||Channel: in=30, out=30, Filter: 2x2, Stride: 2|
|Conv||Channel: in=30, out=30, Filter: 3x3, Stride: 1|
|Conv||Channel: in=30, out=30, Filter: 3x3, Stride: 1|
|Maxpool||Channel: in=30, out=30, Filter: 2x2, Stride: 2|
|Conv||Channel: in=30, out=30, Filter: 3x3, Stride: 1|
|Conv||Channel: in=30, out=15, Filter: 3x3, Stride: 1|
|Maxpool||Channel: in=15, out=15, Filter: 2x2, Stride: 2|
Conv: Convolution; FC: Fully connected; SM: Softmax
The training of CNN for query detection poses the following two main challenges:
Variable size input: The CNN architecture discussed earlier requires fixed size input, but our similarity matrices have variable lengths and widths due to the varying duration of corresponding spoken queries and test utterances. We solve this problem by fixing the size of all input matrices to a predetermined length and width (in our case 100800). Bigger matrices are down-sampled by deleting its rows and/or columns in regular intervals. On the other hand, smaller matrices are increased in size by filling the gap with the lowest value from the corresponding similarity matrices. The down-sampling step does not affect the desired quasi-diagonal pattern severely as the deleted rows and columns are spread throughout the similarity matrix. Also, we did not segment the test utterances in fixed size intervals to perform detection on each segment separately, as it requires the region of occurrence of a query in a test utterance as ground-truth label, which is not available for QbE-STD.
The number of positive and negative samples is highly unbalanced for the query detection task (in our training data is 0.1% to 99.9% respectively), due to the very small frequency of occurrence of a given query in the test utterances. We solve this problem by creating a balanced training set for each training epoch. We choose all positive examples and randomly sample the same number of negative examples from the corresponding set. We also considered using weighted loss function for training, however the experiments showed that our strategy yields better performance.
Vi End to End QbE-STD System
In this section, we propose a novel neural network based end-to-end architecture to perform QbE-STD. We combine the representation learning network with the CNN based matching network in one architecture such that the input to the network are MFCC features corresponding to a query and a test utterance, and the output indicates whether the query occurs in the test utterance. We discuss this architecture and the training procedure in the following sections.
The end-to-end architecture has 3 components as shown in Figure 6
: (i) Feature extraction, (ii) Similarity matrix computation and (iii) CNN based matching. The feature extraction block is used to obtain a frame-level representation using MFCC features as input for both the query and test utterance. The goal of this block is to obtain a language independent representation which produces better frame-level similarity score to construct the similarity matrix. This block can be implemented using DNN, CNN or long short term memory (LSTM) network, and we use DNN for this purpose.
We can use any of the 3 architectures presented in Section III as our feature extraction block. However, we observe that the multilingual network trained using 5 languages generates the best bottleneck features for query detection (see Section VIII). Thus we use this architecture as feature extraction block for our end-to-end system. We use the language independent part of the network (first 5 layers, until bottleneck layer) to extract features from both the query and test utterance which feeds to the second block of our architecture.
The second block of our architecture computes a frame-level similarity matrix between the query and the test utterance using cosine similarity as described in Section V-A. This similarity matrix is input to the CNN to produce a matching score as discussed in Section V-B. This whole network is jointly optimized by training it in an end-to-end manner as discussed in the following section.
Vi-B Training Challenges
The end-to-end network faces same challenges as the CNN based matching network due to the nature of the problem as discussed in Section V-B
. In addition, we do not have sufficient data to train this network from scratch. Thus, we use the principle of transfer learning to initialize different blocks of this network using previously trained network instead of random initialization. The CNN based matching block is initialized with the trained network from Section V and the feature extraction block is initialized with the first 5 layers of the 5 language neural network presented in Section III-B. The weight matrices corresponding to CNN based matching block can be frozen during training to enable the system to only train the feature extraction block. In this setting, the CNN based matching block can be viewed as a loss function to extract better features. These feature vectors should be able to produce more discriminative quasi-diagonal patterns (as discussed in Section V-A) required to classify the positive examples from the negative ones.
Vii Experimental Set-up
In this section, we describe the databases used to train and evaluate different systems. Then, we discuss the training procedure for representation learning, CNN based Matching and the End-to-End
system. We also present the preprocessing steps to perform the experiments and different evaluation metrics used to test and compare our systems.
We use GlobalPhone database  to train the monolingual as well as multilingual models presented in Section III. The QbE-STD experiments are performed on Spoken Web Search (SWS) 2013  and Query by Example Search on Speech Task (QUESST) 2014  databases using DTW based Template Matching. Then, we use the SWS 2013 dataset to train the CNN based Matching network as well as the End-to-End network and evaluate the corresponding models. We use the QUESST 2014 dataset to show the generalization ability of those models.
GlobalPhone Corpus: GlobalPhone  is a multilingual speech database consisting of high quality recordings of read speech with corresponding transcription and pronunciation dictionaries in 20 different languages. It was designed to be uniform across languages in terms of audio quality (type of microphone, noise condition, channel), the collection scenario (task, setup, speaking style), phone set conventions (IPA-based naming of phone) etc. In this work, we use French (FR), German (GE), Portuguese (PT), Spanish (ES) and Russian (RU) to train monolingual as well as multilingual networks and estimate the corresponding bottleneck features for QbE-STD experiments. We have an average of 20 hours of training and 2 hours of development data per language.
Spoken Web Search (SWS) 2013: The SWS 2013 database is part of the MediaEval challenge 2013  for evaluating QbE-STD systems. It consists of speech data from 9 different low-resourced languages: Albanian, Basque, Czech, non-native English, Isixhosa, Isizulu, Romanian, Sepedi and Setswana. It was collected from different sources with varying acoustic conditions and in different amounts from each languages. The variety of data reduces the possibility of over-fitting. There are 505 queries in the development set and 503 queries in the evaluation set. The queries are categorized into 3 types depending on the number of examples available per query, as shown in Table II. The search space consists of 20 hours of audio with 10762 utterances.
Query Set Examples per query 1 3 10 Development 311 100 94 Evaluation 310 100 93 TABLE II: Number of different types of queries available in SWS 2013, partitioned according to the number of examples per query.
Query by Example Search on Speech Task (QUESST) 2014: The QUESST 2014 database is part of the MediaEval challenge 2014  that we use to evaluate the generalizability of different approaches. It consists of 23 hours of speech data (12492 files) in 6 languages as search corpus: Albanian, Basque, Czech, non-native English, Romanian and Slovak. The development and evaluation set has 560 and 555 queries respectively which were separately recorded than the search corpus. We did not use this dataset for training or tuning our models. Unlike SWS 2013 datatset, all queries have only one example available. There are three types of occurrences of a query defined as a match in this dataset. Type 1: exact matching of the lexical representation of a query (same as in SWS 2013), Type 2: slight lexical variations at the start or end of a query, and Type 3: multiword query occurrence with different order or filler content between words (Refer  for more details).
Vii-B Bottleneck Feature Extraction
We use Kaldi toolkit  to extract MFCC features with corresponding ‘delta’ and ‘delta-delta’ coefficients, and generate the target labels for training different neural networks presented in Section III. MFCC features with a context of 6 frames (both left and right) constitutes the input vector of size 507. The context value is optimized using the development queries in SWS 2013. The outputs are monophone states (also known as pdfs in kaldi) corresponding to each language in GlobalPhone corpus. These training labels are generated using a GMM-HMM based speech recognizer . The number of classes corresponding to French, German, Portuguese, Spanish and Russian are 124, 133, 145, 130, 151 respectively. Note that, we also trained these networks using senone classes, however they perform worse than the monophone based training.
We apply layer normalization  before the linear transforms and use rectifier linear unit (ReLU) as non-linearity after each linear transform except in the bottleneck layer for both monolingual and multilingual networks. We train those networks with batch size of 255 samples and dropout of 0.1. In case of multilingual training, we use equal number of samples from each language under consideration. Adam optimization algorithm  is used with an initial learning rate of to train all networks by optimizing cross entropy loss. The learning rate is halved every time the development loss increases compared to the previous epoch until a value of is reached. All the networks were trained for 50 epochs.
We extract bottleneck features from these trained networks and apply speech activity detection (SAD) before using them for DTW as well as CNN based matching. The SAD relies on the silence and noise class posterior probabilities obtained from three different phone classifiers (Czech, Hungarian and Russian)  trained on SpeechDAT(E) database . These probabilities are averaged and compared with rest of the phone class probabilities to identify and remove the noisy frames. Audio files with less than 10 frames after SAD are not used for detection experiments, however those are considered during evaluation [27, 21, 25].
Vii-C CNN Training
The search space for QbE-STD in SWS 2013 database is shared between the development and evaluation queries. The labels for these queries indicate whether a query occurs in a test utterance or not. There is no training set available, thus we only have these queries to train our CNN. We split the 505 development queries in two sets of 495 and 10 queries respectively for training and tuning the model. Due to the multiple examples available for a subset of queries, we effectively have 1551 query examples. Our experiments are designed in this manner to follow the setup of SWS 2013 task and make a fair comparison.
We filter the queries and test utterances using a SAD discussed in previous section to obtain 148810750 training example pairs. It constitutes 24118 positive examples, and rest are negative examples. We balance the data for each training epoch by following the strategy presented in Section V-B. We shuffle the training example pairs and use a batch size of 20 samples. We use the Adam optimization algorithm  with an initial learning rate of to optimize cross entropy loss.
Vii-D End to End Training
The training and development sets for the network presented in Section VI-A consists of the same pairs of queries and test utterances as used to train the CNN in previous section. The difference is: the CNN uses bottleneck features, whereas the end-to-end network uses the corresponding MFCC features. We attempt to train the network by randomly initializing the weight matrices of the whole network. However those trained models yield very poor detection performance. This can be attributed to the limited training data as well as the complexity of the problem. Thus, we begin the training by initializing different blocks of the model with corresponding pre-trained networks as discussed in Section VI-B. In order to limit the trainable parameters, we progressively freeze the first few layers of the feature extraction block and train separate networks. In this case of end-to-end training, the frame-level speech activity detection (SAD) (as discussed in Section VIII-A) is performed on the output of feature extraction network before using them to compute the similarity matrix. It is not applied on the MFCC features in order to avoid discontinuities in the contextual input vectors.
Vii-E Evaluation Metric
We use minimum normalized cross entropy () as primary metric and maximum Term Weighted Value () secondary metric to evaluate the performance of different systems . quantifies the information that is not provided by the scores of a given system. indicates a perfect system and shows a non-informative system. is computed by taking into account the miss and false alarm rates as well as the corresponding costs. We consider cost of false alarm () to be 1 and cost of missed detection (
) to be 100. We also perform one-tailed paired samples t-test to compute the significance of performance improvement in any comparison.
Viii Experimental Analysis
We conducted extensive experiments to evaluate and compare the query detection performance of different systems presented in this paper: (i) DTW based Matching, (ii) CNN based Matching and (iii) End-to-End neural network model.
Viii-a DTW based Template Matching
We perform DTW based template matching using bottleneck features extracted from the monolingual and multilingual networks discussed in Section III and present their detection performance on SWS 2013 and QUESST 2014 databases.
Viii-A1 Performance on SWS 2013
|Training||Single Example||Multiple Examples|
We consider two cases depending on the number of examples per query to evaluate different bottleneck features for QbE-STD. In case of a single example per query, the corresponding features constitute the template. On the other hand, with multiple examples per query we compute an average template before performing the detection experiment as discussed in Section IV-A. The and scores for query detection using both monolingual and multilingual bottleneck features are shown in Table III. We can see that the Portuguese (PT) feature performs the best among the monolingual features with very close performance from Spanish (ES) feature.
The 3 language and 5 language network, as discussed in Section III-B are trained using (PT, ES, RU) and (PT, ES, RU, FR, GE) languages respectively. The 3 language network uses the 3 best performing monolingual training languages. The results in Table III show that both multilingual features perform significantly better than the best monolingual feature. We also observe that PT-ES-RU-FR-GE features significantly outperform PT-ES-RU features indicating that additional languages for training provide better language independent features.
Viii-A2 Performance on QUESST 2014
|Training||T1 Queries||T2 Queries||T3 Queries|
We have only one example per query in case of QUESST 2014 dataset, thus the corresponding bottleneck features constitute the template. It has three different types of queries as discussed in Section VII-A. Similar to , we did not employ any specific strategies to deal with those different types of queries. The and scores corresponding to different types of queries using both monolingual and multilingual features is shown in Table IV. We can see that the bottleneck feature from Portuguese (PT) performs the best among monolingual features for all three types of queries. We have a similar observation as in SWS 2013 that PT-ES-RU-FR-GE network performs better than PT-ES-RU network indicating that more language for training helps in obtaining better features for DTW.
Viii-B CNN based Matching
We use the best performing features (PT-ES-RU-FR-GE) in the previous set of experiments to train a CNN based Matching queries and test utterances and compare their performance.
Viii-B1 Performance on SWS 2013
|System||Single Example||Multiple Examples|
We present the performance of CNN based Matching and compare it with the corresponding DTW based Matching in Table V. Similar to DTW based system, we use template averaging to obtain the template for queries with multiple examples. This method was followed during test time, however the training samples were formed using only single example per query. We observe from Table V that the CNN based Matching performs significantly better in terms of score for both single and multiple examples per query case, showing that the CNN produces more informative scores about the ground-truth than the DTW.
Viii-B2 Performance on QUESST 2014
|System||T1 Queries||T2 Queries||T3 Queries|
We use the model trained on SWS 2013 for testing on QUESST 2014 evaluation set to analyze the generalizability of CNN based matching system. We compare the performance of DTW and CNN based matching in Table VI. As discussed earlier, it has three types of queries and we do not apply any specific strategies to deal with them. We can clearly see that CNN performs significantly better than DTW for all 3 types of queries. The performance gets increasingly worse from Type 1 to Type 2 and from Type 2 to Type 3. This can be attributed to the training of our system using only queries from SWS 2013 which are similar to Type 1 queries from QUESST 2014. However the consistency in performance improvement for all kinds of queries shows that CNN based matching system is generalizable to new datasets.
Viii-C End to End QbE-STD System
We utilize the bottleneck feature extractor and CNN based Matching network to construct the End-to-End QbE-STD system as discussed in Section VI and analyze its performance on both SWS 2013 and QUESST 2014 databases. We also discuss that the CNN based Matching network can be used as a loss function to obtain better features for DTW based template matching.
Viii-C1 Performance on SWS 2013
|# of layers||Single Example||Multiple Examples|
We follow the procedure described in Section VII-D to train the End-to-End network using SWS 2013 database. We freeze the first few layers of the feature extractor while keeping the rest of network trainable and show the corresponding results in Table VII. Similar to previously presented systems, we use template averaging to obtain the template for queries with multiple examples. However, the template averaging is performed after the query examples are forward passed through the feature extractor. We can see from Table VII that the best performance is obtained by training all layers of the feature extractor. It shows that the problem of limited training data can be alleviated by pre-training different parts of the network before end-to-end training.
Viii-C2 Performance on QUESST 2014
|# of layers||T1 Queries||T2 Queries||T3 Queries|
The generalization ability of the models trained on SWS 2013 is evaluated using QUESST 2014 database and the results are presented in Table VIII. We observe that T1 queries perform best with the model trained using 2 frozen layers, whereas T2 and T3 queries perform best with the model trained using 1 frozen layer. It can be attributed to the training of the models using SWS 2013, which enables the network to optimize for that database when fine-tuning all layers of the feature extractor.
Viii-C3 CNN based Matching as Loss Function
In the End-to-End model, we can freeze the parameters of the CNN based Matching network and consider it as a loss function for fine tuning the feature extraction network. This loss function enables the feature extractor to learn and generate features which produce more discriminative similarity matrices to be classified by the CNN. It can be observed through the performance of the system. We use the features obtained after fine-tuning the network to perform DTW based Matching and compare it with the best performance obtained using bottleneck features as shown in Section VIII-A. Similar to previous experiment, we progressively freeze different number of layers of the feature extractor and the results are presented in Table IX. We observe that the feature extractor retrained with 1 frozen layer gives the best results which is significantly better than the bottleneck features indicating the importance of CNN based loss function.
|# of layers||Single Example||Multiple Examples|
Viii-D System Comparisons
Here, we present a final comparison of different systems discussed in this work.
Viii-D1 and scores
The comparisons corresponding to SWS 2013 and QUESST 2014 databases are presented Tables V and VI respectively. We observe that the CNN based Matching performs significantly better than the DTW based Matching in both metrics for QUESST 2014, but for SWS 2013 the improvement is observed only in terms of . The End-to-End system performs significantly better than other systems in both databases, in both metrics.
Viii-D2 DET curves
We present the same system comparison using DET curves in Figures 7 and 8 respectively. In case of SWS 2013 database, we compare the performance using single example per query, and for QUESST 2014 database, we compare T1 query performance. In both databases the CNN based Matching and End-to-End system performs better than the DTW based Matching except for very low false alarm rates.
Viii-D3 Language Specific Performance
We compare the language specific query performance using values in Figures 9 and 10 respectively. In SWS 2013 database, the experiments are performed using single examples per query. This comparison shows that the performance of CNN based Matching and End-to-End system are worse than the DTW based Matching for ‘Isixhosa’, ‘Isizulu’, ‘Sepedi’ and ‘Setswana’ indicating that the performance gains are not uniform throughout different languages. This is due to the considerably less amount of training data corresponding to those languages.
In QUESST 2014 database, we compare the T1 query performances. Similar to SWS 2013 database, non-uniform performance improvement is observed for queries of different languages. The performance is marginally worse only for ‘non-native English’ queries in End-to-End system.
In this paper, we implemented several monolingual as well as multilingual neural networks to extract bottleneck features for QbE-STD and show that more training languages give better performance. We implemented a CNN based Matching approach for QbE-STD using those bottleneck features. It enables discriminative learning between positive and negative classes, which is not featured in DTW based Matching systems. It gives significant improvement over the best DTW system with bottleneck features. Then, we proposed to integrate the bottleneck feature extractor with the CNN based Matching network to provide an end-to-end learning framework for QbE-STD. It gives further improvement over the CNN based matching approach. Both the CNN based Matching and End-to-End system are generalizable to other database, giving significant improvement over the DTW based Matching. We also show that the CNN matching block in the End-to-End system can be used as a loss function to obtain better language independent features which can be useful for other tasks e.g. unsupervised unit discovery.
The research leading to these results has received funding from the Swiss NSF project on “Parsimonious Hierarchical Automatic Speech Recognition and Query Detection (PHASER-QUAD)”, grant agreement number 200020-169398.
-  (2013) The Spoken Web Search task. In the MediaEval 2013 Workshop, Cited by: §IV-B, item ii, §VII-A.
-  (2014) Query-by-example spoken term detection evaluation on low-resource languages. In The 4th International Workshop on Spoken Language Technologies for Under-resourced Languages (SLTU’14), Cited by: item iii, §VII-A.
-  (2018) Phonological posterior hashing for query by example spoken term detection. Proc. Interspeech 2018, pp. 2067–2071. Cited by: §II.
-  (2016) Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: §VII-B.
-  (1997) Multitask learning. Machine learning 28 (1), pp. 41–75. Cited by: 1st item, §III-B.
-  (2013) Model-based unsupervised spoken term detection with spoken queries. IEEE Transactions on Audio, Speech, and Language Processing 21 (7), pp. 1330–1342. Cited by: §I, §II.
-  (2015) Query-by-example keyword spotting using long short-term memory networks. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Cited by: §IV-A.
-  (2017-12) Multitask feature learning for low-resource query-by-example spoken term detection. IEEE Journal of Selected Topics in Signal Processing 11 (8), pp. 1329–1339. External Links: Cited by: §II.
-  (2016) Unsupervised bottleneck features for low-resource query-by-example spoken term detection.. In INTERSPEECH, pp. 923–927. Cited by: §I, §II.
-  (2009) Query-by-example spoken term detection using phonetic posteriorgram templates. In Automatic Speech Recognition & Understanding, 2009. ASRU 2009. IEEE Workshop on, pp. 421–426. Cited by: §I, §II, §II, §IV-B.
-  (2012) Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. Signal Processing Magazine, IEEE 29 (6), pp. 82–97. Cited by: §II, §VII-B.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §VII-B, §VII-C.
-  (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §V-B.
-  (2012) A nonparametric bayesian approach to acoustic model discovery. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1, pp. 40–49. Cited by: §II.
-  (2007) Information retrieval for music and motion. Vol. 2, Springer. Cited by: §I, §II.
-  (2008) Unsupervised pattern discovery in speech. IEEE Transactions on Audio, Speech, and Language Processing 16 (1), pp. 186–197. Cited by: §I, §II, §II.
-  (2017) PyTorch. Note: [online] http://pytorch.org/ Cited by: §VII-D.
-  (2000) SpeechDat (e)-eastern european telephone speech databases. In the Proc. of XLDB 2000, Workshop on Very Large Telephone Speech Databases, Cited by: §VII-B.
-  (2011) The Kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding, Cited by: §VII-B.
-  (1991) Direct transfer of learned information among neural networks.. In AAAI, Vol. 91, pp. 584–589. Cited by: §VI-B.
-  (2018-06) Sparse subspace modeling for query by example spoken term detection. IEEE/ACM Transactions on Audio, Speech, and Language Processing 26 (6), pp. 1130–1143. External Links: Cited by: §I, §II, §VII-B, §VII-D.
-  (2016) Subspace detection of DNN posterior probabilities via sparse representation for query by example spoken term detection. In Seventeenth Annual Conference of the International Speech Communication Association (INTERSPEECH), Cited by: §II.
-  (2017) Subspace regularized dynamic time warping for spoken query detection. In Workshop on Signal Processing with Adaptive Sparse Structured Representations (SPARS), Cited by: §I, §II.
-  (2018) Phonetic subspace features for improved query by example spoken term detection. Speech Communication 103, pp. 27–36. Cited by: §II.
-  (2018) CNN based query by example spoken term detection. In Nineteenth Annual Conference of the International Speech Communication Association (INTERSPEECH), Cited by: §I, §II, §V, §VII-B, §VII-D.
-  (2013) MediaEval 2013 spoken web search task: system performance measures. n. TR-2013-1, Department of Electricity and Electronics, University of the Basque Country. Cited by: §VII-E.
-  (2014) High-performance query-by-example spoken term detection on the SWS 2013 evaluation. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7819–7823. Cited by: §I, §II, §IV-A, §IV-B, §VII-B, §VII-D.
-  (2014) GTTS-EHU systems for QUESST at mediaeval 2014.. In MediaEval, Cited by: §VIII-A2.
-  (1978) Dynamic programming algorithm optimization for spoken word recognition. IEEE transactions on acoustics, speech, and signal processing 26 (1), pp. 43–49. Cited by: §IV-A.
-  (2013) Globalphone: a multilingual text & speech database in 20 languages. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 8126–8130. Cited by: item i, §VII-A.
-  (2008) Phoneme recognition based on long temporal context. Ph.D. Thesis, Faculty of Information Technology BUT. Cited by: §VII-B.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §V-B.
-  (2015) Coping with channel mismatch in query-by-example-but quesst 2014. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5838–5842. Cited by: §I, §I, §II, §III-A, §III, §IV-B, §V-A.
-  (2012) The language-independent bottleneck features. In 2012 IEEE Spoken Language Technology Workshop (SLT), pp. 336–341. Cited by: §III-B, §III.
-  (2011) Improved bottleneck features using pretrained deep neural networks. In Twelfth annual conference of the international speech communication association, Cited by: §III.
-  (2009) Unsupervised spoken keyword spotting via segmental DTW on gaussian posteriorgrams. In Automatic Speech Recognition & Understanding, 2009. ASRU 2009. IEEE Workshop on, pp. 398–403. Cited by: §I, §II, §II.