Multilingual speech recognition is an important feature for modern speech recognition systems allowing users to speak in more than a single, preset language. In Google multilingual speech recognition service [Gonzalez2015Multilingual] users are allowed to select two, or more, languages simultaneously as prior information (Fig. 1). When the microphone is enabled, the system works by running several speech recognziers in parallel, along with an acoustic language identification (LangID) module [wan2019]. After the system decides the language of the utterance, the recognition result of the corresponding language will be used and the language decision can be propagated to downstream systems (e.g. a text to speech module). In our previous work [Gonzalez2015Multilingual]
, the final language identification decision is predominantly taken by the acoustic LangID module, which generates a probability score for each of the language based on the audio. Such approach however, ignores other potentially useful information returned by the individual speech recognizers such as the accumulated language or acoustic model score.
In this work, we explore a few alternative approaches to improve the language identification accuracy by using signals that can be easily computed by most speech recognition systems, including the confidence, acoustic and language model scores computed by the recognizers (Table 1).
Without signal combination, an acoustic LangID model provides a baseline with error rate. Using a lattice-based ensemble model [Fard2016], we were able to reduce the classification error rate to , a relative reduction. We continued exploring new methods and found a deep neural network model outperforms the lattice-based model: the error rate further reduced to , which is a relative reduction from the original baseline.
presents our method with a lattice-based ensemble model and a deep neural network-based improvement. We also explore methods in deep learning that work well in the signal combination problem. Section4 shows the experimental results. Finally, Section 5 concludes the paper.
2 Related Work
Early work on speech-language identification traces back to 1996 [Zissman1996Comparison]Yang2007Tokenization, Campbell2008covariance, Ma2007Discriminative, Campbell2007Lattices, Zhu2008Discriminative]
and random forest based models[Wang2008forest]
. More recently, LangID system is a N-class acoustic-based classifier to generate scores for each language[fer2015multilingual, snyder2018spoken, martinez2011language, wan2019]. So far the most successful types are neural network based methods [snyder2018spoken, wan2019]. They have been proven surpass the performance of the more traditional i-vector embeddings [martinez2011language].
Combining signals form other system in addition to acoustics is a good way to further boost the performance of LangID accuracy, such as text-based features, language model features [Sefara2016Text, Wang2019CNN]. In this work, we have tried both lattice based methods and neural network method to combine text-based semantic features and acoustic features to improve the accuracy of language identification. In our experiments, the most optimal performance is achieved based on our carefully designed neural network model.
|Signal Name||Signal Descriptions|
The probability of the utterance among 79 different target languages [wan2019].
The inputs are log-mel-filterbank of speech sequences and output is likelihood of each language.
|The cost of the acoustic model from the recognizer.|
|The cost of the language model from the recognizer.|
|The confidence score of the recognition result.|
|The Levenshtein distance among every pair of hypotheses in the N best list (For ). The distance is computed at the character level using normalized hypotheses.|
|The language identification [zhang-etal-2018-fast-compact] score based on the recognized text.|
We simplify the signal combination problem as binary classification by paring languages. This section discusses the problem transformation, the data representation, and the classification models.
3.1 Problem Formulation
For each of the language candidates, the upstream systems generate six signals as shown in Table 1 (one from the acoustic LangID model and five generated from recognizer results). We pair different languages and concatenate the signals for language and language to turn the signal combination problem to a binary classification problem with inputs. The goal is to make a classifier that outputs the probability of language is preferred over language .
The classifier satifies a symmetry constraint: . In other words, it should always generate opposite scores when exchanging languages and . When there are more than two language candidates, we pair all of them with each other and we assign a score to language . The language with maximum is selected as the outcome. While there are other ways to rank all of the languages, turning the problem into binary classification allows us to focus on the two-language scenario, as there are more bi-lingual users than multi-lingual speakers.
We generated the training and testing datasets by randomly picking anonymized queries from Google Voice Search in 20 different languages. Training and evaluation data was collected from monolingual traffic, from which we can infer the language labels in an unsupervised fashion. We pair the signals from the correct language with signals from other languages, and we set the label to match the correct language. Our dataset for the experiments contains samples, which we divided into training and testing sets at ratio; we split the dataset by the original utterance to prevent crosstalk between training and testing datasets.
Following the setting of our previous work [wan2018], we use the average pairwise accuracy among all the language pairs to measure the system performance. To reduce the computational cost of running evaluations on all the possible language pairs, in this paper we considered the performance metric to be the average pairwise accuracy among the top 15 language pairs as determined by the volume of traffic.
Fig. 2 illustrates the dataset preparation process. We run these samples through the acoustic LangID model (The details of our acoustic LangID model is described in [wan2019]), as well as recognizers of different languages to generate all the necessary signals.
Missing Features Multiple speech recognizers run in parallel along with the acoustic LangID model. Not all the language recognizers finish before the deadline to make a decision, so the dataset contains missing features. Some samples have all of the signals from both recognizers, and we refer these samples as “both” data. Some samples have all of the signals from one of the recognizers, but only acoustic LangID scores are available on the other side, and we refer these samples as “either” data. For those samples with only acoustic LangID scores on both sides, we refer them as “neither” data. Among the entire dataset, are “both” data, are “either” data, and are “neither” data.
For the “either” dataset, we found a label bias towards the language with the recognizer outputs. Among the “either” dataset, about of training samples has a label that matches the side with recognizer outputs, which is off from the expectation. While whether the recognizer finishes in time is an important indication for classification, we decided to model the bias in other parts of the pipeline. We removed the bias from training data for signal combination by assigning higher weights to the samples has a label matches the no-recognizer-output side, and lower weights for the others.
As of the “neither” data, there is little we can do other than simply comparing the LangID scores. The accuracy is the classification accuracy of the acoustic LangID model, and it is about as of now. The next sections will discuss accurate predictions on the “either” and “both” data.
3.3 Lattice Regression-based Ensemble Model
Lattice regression with monotonic calibrations is a fast-to-evaluate and interpretable tool to combine signals to implement regression and classification [Gupta2016Monotonic]. We implemented a lattice-based model and fine-tuned its parameters to combine signals for LangID.
The lattice regression model is an ensemble of sub-models. The input features are calibrated using monotonic functions with linear inequality constraints. Then, a sub-model only uses eight randomly-selected features from the calibrated inputs. Our fine-tuning experiments conclude that slightly more emphasis on the acoustic model’s scores () generates a higher accuracy — we apply three lattices on , and all the other features have two lattices. Therefore, each sub-model is one of , , or lattice functions, depending on whether from the two sides are used.
Guaranteeing Symmetry One of the short-coming with the lattice-based model is lacking of symmetric. The lattice-based model represents the probability that language is preferred over language , but when exchanging and it does not generate an opposite probability, i.e. for any based on lattice regression, . Although the lattice model was trained with symmetric samples (for dataset , ), the symmetry is not guarnateed during inference. As a workaround, we compute in prediction.
3.4 Deep Neural Network Models
While the lattice-based model is relatively usable, we run out of options to further improve the accuracy as there are only a few parameters to tune the model. To further improve the accuracy of multi-language identification, we experiment with a deep-neural network.
Symmetric Score Function As discussed in the previous section, one of the short-coming with the lattice-based model is lacking symmetry. It has a workaround, but the workaround introduced inconsistency between training and inference. With more flexibility in a deep neural network (DNN), we designed an architecture that guarantees symmetry without differencing training and prediction.
Fig. 3 shows the general architecture of our DNN model. The input passes through a few fully-connected layers and generates ; then, the flipped input passes through the same layers and generates
. The loss function is defined as follows (when , otherwise ):
Function defines how we combines the score and
. Experimentally, we find scaled cosine similarity function[wan2018] gives best performance:
Network Architecture Fig. 3
also shows the elements in each of the fully connected layers: activation, dropout, and normalization. We have tested different activation functions including, , and . With only a few features in the network, dropouts improve model performance. Normalization is another powerful tool to regularize the model. We tested BatchNorm [BatchNormalization] and LayerNorm [ba2016layer]
to stabilizes the network performance. We also tested a different number of layers and number of elements of each layer including “128-64-32-16-8”, “64-64-64-64-64”, and residual connection[he2015deep] between them.
Further Improve the Accuracy There are a few tools in deep neural networks that have the potential to improve the accuracy, just to name a few: voting (model ensemble), parameter smoothing, and data augmentation.
An ensemble model is a good way to “boost” the accuracy of existing models. We trained models using the same architecture, and we test two methods to combine the voting results. Majority voting considers the same weight for all the models; on the other hand, averaging probability scores on each of the models generate an even higher accuracy.
Parameter smoothing prevents the evaluation metrics from frequent oscillation. We applied the exponential moving average on all trainable parameters in the model. But we found that the accuracy sightly decreased after applying parameter smoothing. Our dataset contains missing features (Section3.2). During our experiments with missing features (“either” dataset), we found training on “either” and “both” dataset increases the testing accuracy on the “both” dataset. This leads to a hypothesis that missing features increase the generalizability of the model. We found only testing on the “both” dataset, training with augmented data by masking either side of “both” training set yields higher accuracy than training without augmented data.
4 Experimental Results
We compared multiple model configurations to find the best model. The lattice-regression model has a similar implementation to [tflattice]
. We implemented the deep neural network models with TensorFlow. This section presents the performance of these models.
Best DNN Model Configuration The inputs are re-scaled to the range with simple transformation: we applied on features in the range of ; for the features with a larger span, like the model loss, we took on the features and normalized them to the range. We used a nabla-shape architecture: the number of the hidden units on each of the fully connected layers are “12(input)-128-64-32-16-8.” All of the layers use activation, dropout, and layer normalization [ba2016layer]. In the loss function,
is initialized with identity matrix andis initialized to . All in the fully-connected layers are initialized using He initialization [he2015delving]; are initalized with . We trained models using the same configuration and averages their probability output before making a prediction. The batch-size in training is ; we used Adam optimizer [kingma2014adam] in training with an initial base learning rate of . We decayed the base learning rate to , , , at 10M, 20M, 30M, and 40M steps respectively. We stop at 50M steps when the performance no longer changes.
Table 2 shows the performance of the DNN model and the lattice-regression model compared with the baseline setting using the acoustic LangID scores only. For the “both” data, the language identification error rate is reduced by ; there is error rate reduction for “either” data; and unchanged for “none” data since there is no additional information for these data. The identification error rate is reduced by on the whole testing dataset.
The models are further tested on the multilingual speech recognition system [Gonzalez2015Multilingual]. The system uses the model proposed in this paper to determine the spoken language and turns the utterance into text with the speech recognizer in the corresponding language when the languages are know to the system. The average word error rate is . With the DNN model, we reduces the identification error rate to and achieves word error rate (WER), which is better than the Lattice model.
In this paper, we show deep neural network is a powerful tool for signal combination in the multi-language identification problem. Experimental results show reduction in error rate. The deep neural network model is easier to train and has more flexibility to fine-tune, and it outperforms our fine-tuned lattice-based model.
We believe there are more potentials in using deep neural networks for the signal combination than using lattice regression. Our future work is to further improve the accuracy of signal combination and support better multi-language speech recognition.