Automatic speech recognition (ASR) provides a more natural way to human-machine interaction (HMI). The point of interest (POI) search with voice is one of typical HMI scenes. Assuming you are driving a car on the road and don’t know how to reach your destination, you can give a voice command to a map app to set your destination and start navigation. However, although deep learning techniques improve speech recognition accuracy by a large margin recently, there are still some challenging problems for local POI recognition.
This paper mainly focuses on Chinese POI recognition. In China, dialects vary from geographical regions to geographical regions. Although different dialects may share some similarities, there are obvious differences at the phonological level. As a result, ASR system trained on many dialects simultaneously may fail to generalize well for each individual dialect.
There are massive POI names in China. The total amount of POI names to be searched in a navigation system is usually more than 1,200 million. Since POI names follow a long-tailed distribution, it is ineffective to model infrequent POI names using a general language model (LM). Another difficulty for POI recognition is that there are many homophone POI names, which is especially serious in China.
In order to alleviate the above problems in POI recognition, geographical location information is used for both acoustic modeling and language modeling in this paper. The main contributions of this paper are as follows:
This paper proposes a geographical acoustic model (Geo-AM) to deal with the multi-dialect problem. Dialects are usually specific to geographical regions or social groups. Therefore, the Geo-AM encodes users’ geographical location into a dialect-specific vector as an additional input feature. Furthermore, the Geo-AM introduces multiple dialect-specific top layers, each of which corresponds to a dialect region. With dialect-specific top layers, the proposed Geo-AM can efficiently exploit geographical information while keeping flexibility for further optimization.
Generally, users are only interested in nearby POI names. In a specific region, the total number of POI names is much smaller and there are less homophone POI names. Starting from this, we build and integrate a group of geo-specific language models (Geo-LMs) into the ASR system to improve the recognition accuracy of long-tailed and homophone POI names. During decoding, a specific Geo-LM will be selected on-demand according to geographic location information, which is attached to user queries. To further improve recognition accuracy, the n-best rescoring is done with a neural network LM combined with another group of Geo-LMs built for rescoring.
The rest of this paper is organized as follows. In Section 2, we discuss some related works on multi-dialect speech recognition and POI recognition. Section 3 describes the details of the baseline ASR system used in this paper. Geo-AM and Geo-LMs are described in Section 4 and Section 5 respectively. Section 6 shows the experimental results and analysis. Finally, Section 7 concludes this paper.
2 Related Works
Recently, there have been some attempts to solve the multi-dialect problem in speech recognition, which mainly fall into two ways: “multi-model” and “single-model”. In multi-model approaches, an individual AM is trained for each dialect when enough data is available for each dialect. When dialect-specific data is scarce, Huang et al.  and Chen et al.  provide a solution that jointly train an universal AM which will be fine-tuned with dialect-specific data to get dialect-specific AMs. In single-model approaches, a single AM is trained to deal with all dialects. Some of them feed dialect-related features, such as I-vectors  or dialect information [11, 8, 17], into AMs to deal with the dialect problem. Some researchers introduce multi-task learning into multi-dialect speech recognition. Yang et al.  adopts dialects classification as the secondary learning task. Compared with multi-model approaches, single-model approaches are usually more efficient but less flexible.
In this paper, we first construct a “single-model” baseline like . In order to further optimize the corresponding component for a specific dialect when additional data is available, dialect-specific top-layers are further introduced into the proposed model.
An efficient way to improve speech recognition accuracy of POI names is to utilize geo-location dependent LMs [6, 14, 2, 15]. For each user, Sten et al.  trains a Geo-LM dynamically using nearby POI names and combines the Geo-LM with a baseline LM before or at decoding. In , a class-based Geo-LM is constructed dynamically for each user depending on users’ geographic location, within a difference-LM based weighted finite state transducer (WFST) system. All above approaches construct LMs or WFSTs on-the-fly according to users’ geographical locations, which is time consuming and hard to incorporate plenty of POI names into a Geo-LM. Moreover, the class-based Geo-LMs can only deal with pre-defined grammars.
In this paper, for each pre-defined region a Geo-LM is pre-trained and Geo-LMs are dynamically combined with a baseline LM during decoding. In addition, prior works mainly integrate Geo-LMs into first-pass decoding. To further improve recognition accuracy, we integrate Geo-LMs into both first-pass decoding and n-best re-scoring.
3 Baseline System
The baseline AM is trained with lattice-free maximum mutual information (LF-MMI)  criterion. It consists of two CNN layers and three TDNN-OPGRU  blocks, which interleave TDNN and output-gate PGRU (OPGRU) layers. Besides, SpecAugment  algorithm is used to improve the robustness of the AM.
The baseline LM is a word-level Kneser–Ney smoothed 5-gram model. For further improvement, a character-level Kneser–Ney smoothed 5-gram LM and a QRNN  model are used to rescore n-best lists of the first-pass decoding output.
4 Geographical Acoustic Model
4.1 Dialect-specific Input Feature
From the aspect of linguistics, China can be divided into several dialect regions. People from some adjacent provinces usually have similar acoustic characteristics. Therefore, such provinces can be clustered into one dialect region.
. Dialect-specific input feature will be transformed by an affine layer before added to the output of TDNN or OPGRU layers. In this way, the proposed Geo-AM can utilize additional dialect information at both training and inference stage.
4.2 Dialect-specific Top Layer
The model mentioned in Subsection 4.1 can achieve gains generally over the baseline model due to the additional supervised information. However, it is hard to improve the accuracy of a specific dialect while maintaining the performance on other dialects. This may be attributed to that dialect-specific information and dialect-independent information are coupled together in that model. In order to make the Geo-AM more flexible, here we introduce dialect-specific layers into the Geo-AM.
As found in [7, 3], the top layer can capture dialect information. Here we introduce dialect-specific top layers into the proposed Geo-AM as depicted in Figure 1. Each dialect has its own top layer which is adapted from the model in Subsection 4.1. During the adaptation training, only top layer’s parameters update while other parameters are fixed, which is easy for deployment.
5 Geographical Language Model
For simplicity, this paper divides China into 34 local regions at province-level. For each region, a word-level Geo-LM and a character-level Geo-LM are trained with local POI names in that region.
5.1 Geo-LMs in first-pass decoding
The underlying ASR system is based on a WFST based decoder, which employs the difference LM principle as follows:
where denotes on-the-fly composition, contains HMM definitions, represents the context dependency,
is the lexicon,is a small LM consisting only of uni-grams and bi-grams of the baseline 5-gram LM, and
where is negated score version of , is the 5-gram baseline LM and is a Geo-LM.
For each query, we first get the province, in which the user locates, by location based services (LBS). Then the corresponding Geo-LM is selected to do on-the-fly composition according to Eq.(1) and Eq.(2
). As a result, the probability of a wordin first-pass decoding is
where is context, is the probability from the baseline LM, is the probability from a Geo-LM and is a scalar that controls the contribution of different LMs.
5.2 Geo-LMs in n-best rescoring
In order to further improve recognition accuracy of local POI names, a QRNN  model is used to rescore n-best lists of the first-pass decoding outputs. A single neural model usually fails to model long-tailed POI names. Moreover, it is impractical to train a geographical neural model for each region due to data sparsity. Therefore, we also incorporate a group of character-level Geo-LMs into the process of n-best rescoring. Like Eq.(3), the probability of the word in second-pass decoding is
where is the probability from the character-level baseline LM, is the probability from the QRNN model, is the probability from a character-level Geo-LM, and control the contribution of different LMs.
After getting probabilities and , the final probability of the word is
where is a constant.
We train AMs on hand-transcribed, anonymized utterances from our production including Tencent Map. Our training data are collected from all regions of China, which is amounted to about 20K hours. Only one fifth training data have region information. In all experiments, 40-dimensional PNCC  is used as acoustic feature.
Both the word-level baseline LM and the character-level baseline LM are trained with 1,200M POI names collected from Tencent map. In order to limit the model size, the baseline LMs are trained with large cutoffs 0-3-5-10-15. As a result, many long-tailed or infrequent POI names are excluded from the baseline LMs. For each province, a word-level Geo-LM and a character-level Geo-LM are trained with local POI names in that province collected from Tencent map and the Internet. The amount of training data for Geo-LMs varies from 30K POI names to 12.6M POI names. The Geo-LMs are trained with small (standard) cutoffs 0-2-2-2-2, which keeps more long-tailed POI names in Geo-LMs. In addition, the QRNN model adopts a adaptive softmax output layer  to reduce computational complexity.
|2||Sichuan Chongqing Guizhou||345|
|4||Heilongjiang Jilin Liaoning||372|
|6||Shanxi Gansu Shaanxi||247|
|7||Hunan Hubei Anhui||396|
|8||Yunnan Guangxi Fujian||301|
|9||Beijing Tianjin Hebei||429|
Both development set and test set are collected from our POI voice search production, Tencent Map. The development set consists of 13,350 utterances collected from users across the whole China. The test set contains 15,205 utterances collected from users of top-10 provinces with most traffic. The detailed data distribution in the development set and test set is presented in Table 2.
The baseline AM (A0) is trained with all corpora, which is geo-location independent. We obtain a Geo-AM, A1, by attaching dialect-specific input feature to the baseline AM and fine-tune it using corpora with dialect information. We also try to directly fine-tune A0 using the same corpora as A1, but achieve no improvement. Table 3 shows that our model benefits from the dialect-specific input feature, with an overall CER reduction of 4.3%. Only slight improvement is found in some regions (e.g., Region 5, Region 6). We argue that division of dialect regions and distribution of training data should account for this.
For further improvement, a more superior Geo-AM A2 is obtained by introducing dialect-specific top layer to the model A1. The model A2 is initialized from A1. Dialect corpora are only used to train dialect-specific top layer while other parameters are frozen. Results in Table 3 show that we can get more gains by introducing dialect-specific top layers. This indicates that Geo-vector is not powerful enough to encode dialect information. Besides, dialect-specific top layers make it easy to improve the performance of a certain dialect, as we can train each top layer individually.
To show A2’s superiority, we try to increase the amount of training data for dialect region 1, from 522 hours to 892 hours. Then we train another two Geo-AMs A1+ and A2+. A1+ is fine-tuned from A0 and A2+ is fine-tuned from A1. Results of Table 4 show that both A1+ and A2+ achieve better performance on dialect region 1 compared to A1 and A2. However, A1+ gets worse performance on several other dialect regions compared to A1 while A2+ maintains the performance on other dialect regions with the help of dialect-specific top layers. As we argued above, A2 gives us more flexibility which is important in real production service.
To further verify the relationship between Geo-AM and multi-dialect problem, we divide the development set into 4 subsets according to the level of accent and provide results in Table 5. It suggests that Geo-AM performs better on heavy-accent utterances.
Results in Section 6.1 show that the proposed Geo-AM can alleviate multi-dialect problem to some extent. However, it cannot deal with long-tailed and homophone POI names. Therefore, we adopt Geo-LMs in first-pass decoding as described in Section 5. Detailed results on the development set are presented in Table 6. In order to evaluate whether Geo-LMs are effective in all provinces, Table 6 provides the overall results as well as the results of top-5 provinces with the most traffic (Guangdong, Henan, Shandong, Jiangsu, Zhejiang) and tail-5 provinces with the least traffic (Gansu, Hainan, Ningxia, Xizang, Qinghai). Results show that Geo-LMs can significantly improve the recognition accuracy of local POI names both in top provinces and tail provinces.
Rescoring n-best lists of the first-pass decoding output with a neural LM generally provides further improvements. We use a QRNN model and a 5-gram character-level ngram model to rescoring the n-best lists like Eq.(4) but without Geo-LMs. Results are shown in the forth column in Table 6. Rescoring n-best lists reduces the CER on 4 top provinces and 1 tail province but increases the CER on 1 top province and 3 tail provinces. This is probably due to both the QRNN model and the 5-gram character-level ngram model do not utilize geographical information. Therefore, we also integrate a group of Geo-LMs in the process of n-best rescoring as Eq.(4). Results are shown in the last column in Table 6. Results show using Geo-LMs in second-pass decoding can further improve recognition accuracy of local POI names both in top provinces and tail provinces.
Finally, we evaluate the proposed Geo-AMs and Geo-LMs on the test set. Results are consistent with those on the development set and details are shown in Table 7. The proposed Geo-AM and Geo-LMs totally achieve a 18.7% relative CER reduction.
Speech recognition of local POI names is still a challenging task due to multi-dialect and massive long-tailed POI names. This paper proposes a Geo-AM to deal with the multi-dialect problem by combining dialect-specific input feature and dialect-specific top layers. In order to improve recognition accuracy of long-tailed POI names, a group of Geo-LMs are integrated into the process of first-pass decoding and n-best rescoring. Experiments show the proposed Geo-AM can indeed alleviate the accent problem and achieve 6.5%10.1% relative CER reduction on test sets of different accent levels. The proposed Geo-AM and Geo-LMs totally achieve 18.7% relative CER reduction on the POI voice search task of Tencent Map. In addition, introducing Geo-LMs into the process of n-best rescoring can achieve much better results.
-  (2016) . arXiv preprint arXiv:1611.01576. Cited by: §3, §5.2.
-  (2015) Geo-location for voice search language modeling. In Sixteenth Annual Conference of the International Speech Communication Association, Cited by: §2.
-  (2015) Improving deep neural networks based multi-accent mandarin speech recognition using i-vectors and accent-specific top layer. In Sixteenth Annual Conference of the International Speech Communication Association, Cited by: §2, §4.2.
Output-gate projected gated recurrent unit for speech recognition.. In Interspeech, pp. 1793–1797. Cited by: §3.
-  (2010) Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing 19 (4), pp. 788–798. Cited by: §2.
-  (2009) Language modeling for what-with-where on goog-411. In Tenth Annual Conference of the International Speech Communication Association, Cited by: §2.
-  (2014) Multi-accent deep neural network acoustic model with accent-specific top layer using the kld-regularized model adaptation. In Fifteenth Annual Conference of the International Speech Communication Association, Cited by: §2, §4.2.
-  (2018) Improved accented speech recognition using accent embeddings and multi-task learning.. In Interspeech, pp. 2454–2458. Cited by: §2.
Efficient softmax approximation for gpus.
International Conference on Machine Learning, pp. 1302–1310. Cited by: §6.
-  (2016) Power-normalized cepstral coefficients (pncc) for robust speech recognition. IEEE/ACM Transactions on audio, speech, and language processing 24 (7), pp. 1315–1329. Cited by: §6.
-  (2018) Multi-dialect speech recognition with a single sequence-to-sequence model. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4749–4753. Cited by: §2, §2, §4.1.
-  (2019) Specaugment: a simple data augmentation method for automatic speech recognition. arXiv preprint arXiv:1904.08779. Cited by: §3.
-  (2016) Purely sequence-trained neural networks for asr based on lattice-free mmi.. In Interspeech, pp. 2751–2755. Cited by: §3.
-  (2009) Geo-centric language models for local business voice search. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 389–396. Cited by: §2.
-  (2018) Geographic language models for automatic speech recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6124–6128. Cited by: §2.
-  (2018) Joint modeling of accents and acoustics for multi-accent speech recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. Cited by: §2.
-  (2019) A highly adaptive acoustic model for accurate multi-dialect speech recognition. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5716–5720. Cited by: §2.