Spoken term detection (STD) is a task designed for efficient keyword search (given text query) in a speech databases, and plays a central role in information management and speech retrieval [12, 4, 6, 13]
. State-of-the-art STD approaches include two subsystems. The first one is an automatic speech recognizer (ASR), which is used to transcribe the spoken utterances into text. The text transcriptions contain all the possibly recognized words with corresponding posterior probabilities[5, 12, 13]. The posterior probability as been one typical confidence measure plays a central role in keyword searching. The second subsystem is a keyword searcher which returns the results of term detection for each query term according to the decoded transcriptions. Formally, in STD applications, a confidence measure (CM) is defined to represent the reliability of each detected term occurrence, which is usually estimated by the recognizer [5, 12]. Relying on the confidence measure, the final term detection results could be obtained by threshold-based recall. However, when only limited training resources are available for building the ASR system, the accuracy of the recognizer and the reliability of the confidence measure are relatively low, which makes it difficult to find correct query results in the speech database.
This paper focuses on the calculation of confidence measure for STD when the speech recognizer has been built. In this situation, a one-pass retrieval candidate set can be obtained for each query. Each candidate contains the term occurrence location information and the corresponding confidence measure. The baseline system of this paper could then be evaluated on it directly by conducting standard score normalization and final decision [11, 13]. To improve the reliability of term occurrences, some recent efforts have attempted to do this work and have achieved some improvements on STD task. In [10, 9], the confidence measure of query occurrence is re-estimated based on the context consistency information. 
proposed a two-stage cascaded machine learning approach for rescoring keyword search outputs for low resource languages.
proposed a modified logistic regression strategy for term detection optimization. Discriminative score normalization method was introduced to normalize confidence measures through discriminative modeling. Moreover, another method was proposed in  to employ extra acoustic features for getting a better confidence measure.
. Clustering and latent topic models have also gained improvements over traditional vector space models for IR[21, 2]. Besides, the well known PageRank algorithm considers the hyperlink between every two pages and computes a converged importance score for each page . Inspired by these work, this paper proposes to integrate document ranking information into the calculation of confidence measures of term occurrences for spoken term detection. The document ranking information is defined to be the topic relevance between the document and query term. For each query term, there are some documents tend to be more related to it because they are of a similar topic. When examining the accuracy of STD results, those topic-related documents tend to contain more correct hits. In detail, this information is quantized as a ranking weight for each document in this paper. Based on the one-pass retrieval candidates for a specific query term, we first sum up the confidence measures of all term occurrences in each document. The document ranking weights are then estimated by normalizing these sums and are further integrated into the original confidence measures through linear interpolation. Experiments on three standard STD tasks demonstrate the effectiveness of our proposed method.
For the rest of this paper, we will describe the related works of this paper in Section 2. The proposed algorithm for confidence measure calculation will be presented in Section 3. Section 4 and 5 are the experimental setup and results on three standard STD tasks. Finally, we will conclude our work in Section 6.
2 Related Work
There are some other work attempted to utilize long-term contexts for STD. In , they improved term detection performance based on the word burstiness in spoken conversational corpora. More recently, [22, 17] took advantage of word repetition to improve spoken term detection, having observed the phenomenon of word repetition within single documents. They leveraged the burstiness of keywords by taking the most confident keyword hypothesis in each document and interpolating with lower scoring hits. Although they had designed an effective method to determine the inter coefficients in their experiments, they focussed on intra-document term repetition, without paying attention to the inter-document contexts, e.g. the document ranking information used in this paper. The work in  is very similar to us since they also gave a high priority to the candidate segments that are included in highly ranked documents. However, they proposed to calculate the position dependent document weights recursively. This paper calculates document ranking weights in a more easier way and considers the inter document ranking information. In this paper, we will rank all documents in the speech database according to their relevance with a specific query term and incorporate such document ranking information into the calculation of confidence measures.
3 Proposed Method
Input: The set of one-pass retrieval candidates given query term .
Output: The document ranking weights for all documents in the database.
Cluster the documents in all the hypothesized occurrences of term by summing all the confidence measures in each document :
where can be viewed as the occurrence possibility of term in document . The maximum score for term can also be obtained if we traverse all the documents.
The ranking weight for each document is calculated using the “relative-to-max” method, which is obtained by dividing by :
For an input query term, a set of one-pass retrieval candidates in the speech database is firstly generated following the conventional STD approach. Each term detection occurrence commonly contains location information and a confidence measure, while the location information usually includes the located document name (or ID), start time and duration time. For example, for term , we use to represent the location information of the -th detection occurrence of term . If the location information indicates that this occurrence candidate belong to document , then the confidence measure of the -th term detection occurrence confidence measure can be denoted as . We use subscript “base” to emphasis that this measure is obtained from the one-pass retrieval candidate set. The confidence measure is designed to describe the reliability of a detected term occurrence, i.e., a correct query hit is expected to have a high confidence measure. However, when the ASR subsystem performs poorly, there may be many false alarms with high confidence measure as well as correct candidates with low confidence measure.
Based on the idea we have described in the introduction section, we propose to use document ranking information to improve the calculation of confidence measures. The algorithm to estimate the document ranking weight for a input term is described in Algorithm 1. After the calculation of document ranking weights, we re-estimate the confidence measure of each occurrence by combining the original one with the ranking weight of the document it belongs to. In this work, a linear interpolation is adopted as
where the interpolation coefficient for interpolation is consistent for all query terms, and it can be tuned using a development set. In short, the algorithm of confidence re-estimation can be divided into three steps, i.e., document clustering, document ranking and confidence re-estimation.
4 Experimental Setup
4.1 Data Set and Evaluation Condition
The experiments were conducted using three standard spoken term detection tasks, the STD 2006 English conversational telephone speech (CTS) evaluation set, the OpenKWS 2013 Vietnamese and the OpenKWS 2014 Tamil development sets111http://www.nist.gov/itl/iad/mig/openkws.cfm. The English CTS evaluation set included about 3 hours of speech, and the keyword set consisted of 411 keywords. The development sets of Vietnamese and Tamil included about 10 hours of speech respectively. The evaluation keyword set for Vietnamese consisted of 4065 keywords, with 901 of those keywords appearing in the development set and being used in our experiments. For the Tamil task, we used the kwlist3 keyword set supplied by IBM, which consisted of 2375 keywords. The intention of using three tasks was to evaluate the proposed algorithm using three very different languages, with different ASR accuracy, different amounts of training data and with variations in the sizes of keyword sets. The evaluation criterion used in the experiments was the Actual Term Weighted Value (ATWV) defined by NIST, using a cost function of the false alarm probability P(FA) and P(Miss), averaged over a set of queries222http://www.itl.nist.gov/iad/mig/tests/std/2006/docs/std06-evalplan-v10.pdf.
4.2 Automatic Speech Recognizer
Our ASR engines were built using the DNN-HMM based acoustic modeling, which is the state-of-the-art approach for speech recognition .
For the English task, 309 hours of Switchboard speech were used to train the acoustic model, and the transcriptions of these speech files were used to train a 3-gram language model. The cross entropy criterion was used to train the DNN models. The word accuracy (ACC) of the ASR system on the evaluation set was 77.67%.
For the Vietnamese recognizer, two approaches were adopted to prevent the over-fitting problem in DNN training since the training corpus contains only about 70 hours of speech. The first approach was cross-lingual training, where we used a DNN model acquired from 1000 hours of Chinese CTS data to initialize the Vietnamese DNN parameters. Furthermore, the rectified linear unit (ReLU) activation function was used to replace the sigmoid function in the DNN model. The transcripts of the Vietnamese training files were then used to train a 2-gram language model. A word ACC of45.76% was achieved on the development set. The strategy employed for the Tamil ASR engine was similar to that used for Vietnamese. The only difference was that the sequence training algorithm was applied in the DNN training for Tamil. A word ACC of 31.03% was achieved on the development set.
4.3 STD Indexer and Keyword Searcher
We designed a toolkit named iSTD to build our keyword search subsystem for STD. We followed the work in [12, 13] to construct the inverted index based on confusion networks. The term occurrence candidates were then found by keyword searching on the inverted index. The confidence re-estimation algorithm proposed in this paper was also integrated into this toolkit.
5 Experimental Results
5.1 Effectiveness of Document Ranking
In order to validate the rationality of applying the document ranking information to STD tasks, we examined the relationship between the performance of term detection and the document ranking positions. Here, the document ranking positions were derived by sorting all documents in descending order of the weights calculated following Algorithm 1. Figure 1
shows the correlation curve for the aforementioned Vietnamese STD task. The results were obtained by averaging over 901 query keywords. The correlation curves reveal that the documents with high document ranking weights usually have high precision and recall of term detection.
In addition, we calculated the Spearman rank correlation coefficient between the two performance measurement of term detection and the document ranking weights on the three STD tasks. The results are given in Table 1 and shows the existence of high correlations. All these results indicate that the document ranking information is strongly correlated with the STD performance and it is reasonable to integrate it into the calculation of confidence measures for the term detection.
|STD Task||Spearman Correlation|
5.2 Results of Tuning Interpolation Coefficients
The interpolation coefficient in (4) controls the balance between the document weights and the baseline confidence measures for a specific query term. To explore its practical effcets, the ATWVs on the development set of the Vietnamese STD task versus different interpolation coefficients were depicted in Fig. 2. We can see that a reasonable choice for is within the range 0.05 to 0.4. In the next section, experimental results will be presented for different tasks, where was tuned on the development and set to be 0.05, 0.1 and 0.15 for Tamil, Vietnamese and English respectively.
5.3 Results of STD Tasks
We compared the proposed confidence measure re-estimation algorithm with the baseline system for the three STD tasks. The baseline system directly adopted the ASR posterior score as the confidence measure for each query term. Keyword-specific threshold was applied for all systems as the final decision recall method . Experimental results are listed in Table 2. We can see that the proposed confidence re-estimation approach achieves consistent improvements for all the three typical speech retrieval tasks. Considering the amount of training data available in these three tasks, the results in Table 2 also indicate that the proposed confidence re-estimation method is neither language-dependent, nor sensitive to the amounts of training resources.
This paper has presented an algorithm to improve the calculation of confidence measures for spoken term detection. Inspired by the PageRank algorithm and the application of language models in the text information retrieval area, we propose to integrate the document ranking information into the calculation of confidence measures for term occurrences. The document ranking information indicates the topic relevance between each document and the query term, while topic-related documents are expected to contain more correct hits. Experiments on three standard STD tasks demonstrate the effectiveness of this algorithm by introducing document ranking information.
-  S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. Computer networks and ISDN systems, 30(1):107–117, 1998.
-  B. Chen. Latent topic modelling of word co-occurence information for spoken document retrieval. In Proc. ICASSP, pages 3961–3964. IEEE, 2009.
-  J. Chiu and A. I. Rudnicky. Using conversational word bursts in spoken term detection. In Proc. INTERSPEECH, pages 2247–2251, 2013.
-  J. G. Fiscus, J. Ajot, J. S. Garofolo, and G. Doddingtion. Results of the 2006 spoken term detection evaluation. In Proc. SIGIR, volume 7, pages 51–57, 2007.
-  H. Jiang. Confidence measures for speech recognition: A survey. Speech communication, 45(4):455–470, 2005.
-  J. Kohler, M. Larson, F. de Jong, W. Kraaij, and R. Ordelman. Spoken content retrieval: Searching spontaneous conversational speech. In ACM SIGIR Forum, volume 42, pages 66–75. ACM, 2008.
-  K. Konno, Y. Itoh, K. Kojima, M. Ishigame, K. Tanaka, and S.-w. Lee. High priority in highly ranked documents in spoken term detection. In Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2013 Asia-Pacific, pages 1–4. IEEE, 2013.
-  H.-y. Lee, P.-w. Chou, and L.-s. Lee. Improved open-vocabulary spoken content retrieval with word and subword lattices using acoustic feature similarity. Computer Speech Language, 2014.
H.-y. Lee, T.-w. Tu, C.-P. Chen, C.-y. Huang, and L.-s. Lee.
Improved spoken term detection using support vector machines based on lattice context consistency.In Proc. ICASSP, pages 5648–5651, 2011.
-  H. Li, J. Han, T. Zheng, and G. Zheng. A novel confidence measure based on context consistency for spoken term detection. In Proc. INTERSPEECH, 2012.
-  J. Mamou, J. Cui, X. Cui, M. J. F. Gales, B. Kingsbury, K. Knill, L. Mangu, D. Nolden, M. Picheny, B. Ramabhadran, R. Schlüter, A. Sethy, and P. C. Woodl. System combination and score normalization for spoken term detection. In Proc. ICASSP, pages 8272–8276, 2013.
-  J. Mamou, B. Ramabhadran, and O. Siohan. Vocabulary independent spoken term detection. In Proc. SIGIR, pages 615–622. ACM, 2007.
-  L. Mangu, B. Kingsbury, H. Soltau, H.-K. Kuo, and M. Picheny. Efficient spoken term detection using confusion networks. In Proc. ICASSP, pages 7844–7848, 2014.
-  V. T. Pham, H. Xu, N. F. Chen, S. Sivadas, B. P. Lim, E. S. Chng, and H. Li. Discriminative score normalization for keyword search decision. In Proc. ICASSP, pages 7078–7082, 2014.
-  J. M. Ponte and W. B. Croft. A language modeling approach to information retrieval. In Proc. SIGIR, pages 275–281. ACM, 1998.
-  Y. Proc. Wang and F. Metze. An in-depth comparison of keyword specific thresholding and sum-to-one score normalization. In INTERSPEECH, 2014.
-  J. Richards, M. Ma, and A. Rosenberg. Using word burst analysis to rescore keyword search candidates on low-resource languages. In Proc. ICASSP, pages 7824–7828, 2014.
F. Seide, G. Li, and D. Yu.
Conversational speech transcription using context-dependent deep neural networks.In INTERSPEECH, pages 437–440, 2011.
-  V. Soto, L. Mangu, A. Rosenberg, and J. Hirschberg. A comparison of multiple methods for rescoring keyword search lists for low resource languages. In Proc. INTERSPEECH, 2014.
-  J. van Hout, L. Ferrer, D. Vergyri, N. Scheffer, Y. Lei, V. Mitra, and S. Wegmann. Calibration and multiple system fusion for spoken term detection using linear logistic regression. In Proc. ICASSP, pages 7188–7192, 2014.
-  X. Wei and W. B. Croft. Lda-based document models for ad-hoc retrieval. In Proc. SIGIR, pages 178–185. ACM, 2006.
-  J. Wintrode and S. Khudanpur. Can you repeat that? using word repetition to improve spoken term detection. In Proc. ACL, pages 1316–1325. Association for Computational Linguistics, 2014.
-  C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems (TOIS), 22(2):179–214, 2004.