Large vocabulary continuous speech recognition (LVCSR) systems are becoming increasingly relevant for industry, tracking the technological trend toward increased human interaction using voice-operated devices . However, even for accurate LVCSR systems, certain obstacles can diminish the user experience. For multilingual speakers, one such obstacle is the monolingual character of LVCSR systems, meaning that users are limited to speaking a single, preset language. One way to solve this problem is to use a spoken Language Identification (LID) system to select the most likely language in the spoken utterance . In some cases, multilingual LVCSR systems can also take advantage of the fact that most multilingual speakers only speak a limited number of languages, . For example, the user can be asked to pre-select candidate languages from the super-set of all available languages. By conditioning the multilingual LVCSR system to a tuple of candidate languages, the LID system can condition its decision to it. The LID system could, for example, discard other potentially misleading languages. We refer to this problem as the Tuple-Conditioned Multi-Class Classification (TCMCC) problem. We argue that most LID systems optimized using the softmax cross-entropy loss (we will refer to it as softmax loss
for simplicity) may be sub-optimal for the TCMCC problem, as the softmax loss only optimizes the evaluation metric for the cases that. To solve this problem, we extend the idea of conditioning the LID system to the training stage by introducing a new loss that directly optimizes the TCMCC problem. We refer to this loss as tuplemax loss. Moreover, we show that as the tuple size grows , the tuplemax and the softmax losses converge asymptotically, which complies with a provable property that tuplemax can be regarded as a generalization of the softmax loss.
. Most modern LID systems are composed of two-stage processing. First, a fixed-length vector representation of the utterance (i.e. embedding) is used to encode the language information of the input sequence. Then, a discriminative
-class classifier is trained on the space of the embeddings to generate scores for each language  . So far the most successful types of embeddings are those that are computed using neural networks , which have shown to be able to match or surpass the performance of the more traditional i-vector embeddings 
, particularly in short utterances. In our LID model, which is based on a multi-layer deep Long Short-Term Memory (LSTM) architecture, the embedding vector is given by the last activation vector of the sequence. This temporal pooling operation also allows the system to provide scores with minimal latency and without requiring any right context or padding. The second stage is usually governed by a discriminative Gaussian classifier trained on the space of the embeddings, potentially followed by an additional calibrating system
. In our model, we choose to directly use the probabilities produced by the last layer
. This is an approach that provides a simpler end-to-end optimization. However, we believe that the tuplemax loss could also be beneficial for some of the systems based on loosely connected classifiers, particularly in the cases where the network computing the embeddings is optimized using the cross-entropy loss of a softmax layer, like.
2 Tuplemax Loss
Each training utterance is first processed to generate a feature sequence with 40-dimensional log-mel-filterbank and a label , where and is the number of target languages. Let be the -dimensional output of the last layer of the network. represents the unnormalized distribution of input over the target languages, where represents the parameters of the network. Let be the predicted unnormalized probability of the language.
The general language identification problem is formulated as a standard multi-class classification problem, which maximizes the probability of the correct label and minimizes that of all the others, equally. This is represented as Eq. (1), where is
However, at inference time, the system must choose from only two or three languages () which the user has picked in their settings beforehand. There are exponential number of possible language combinations that the users may choose. It is impractical to train individual models for each possible language set. Therefore, we must train a model that can output probabilities for each of the supported languages and have the inference system pick the best language from the user selected subset . This discrepancy between , the number of supported languages, and , the user selected subset means that standard softmax loss might not be optimal for our problem. The majority of multilingual speakers speak two languages. This makes the pairwise classification particularly interesting.
Practical Example Consider two distributions with the first language as ground truth: and . The softmax cross-entropy loss for both outputs is . However, the second output is significantly better than the first one, since the system will always return the correct language for all the language pairs that contain the true language and any other one based on the second output. In contrast, the first output will have one incorrect pair.
In order to model such pairwise relationship, we have pairwise loss defined in Eq. (2):
where denotes the expection of a set . A subset is a collection of all the tuples that includes the correct label with elements with , the loss becomes:
where is all the tuples in with size . There are tuples in . In particular, the pairwise loss and the softmax cross-entropy loss of multi-class classification are two special cases:
Eq. (3) defines a loss when we want to produce a label from a subset of labels. The total loss is a weighted sum of all tuplemax loss of different sizes for all -language case based on their probabilities, we called it tuplemax loss:
where is the probability of tuples of size and is loss associate with it.
Eq. (4) says if we have a prior knowledge about subset size distribution , the most optimal loss function to train neural network with tuplemax loss. If such set is always a set of all labels , we should use softmax loss. In practice, more than of the users only specify two languages. Currently, we train and evaluate our model focus on pairwise relationship in our experiments. This is only a special case of tuplemax loss defined in Eq. (4).
3 Model Description
Our LID model is trained to distinguish among 79 different target languages. To speed up the training process, we truncate utterances that are longer than 4 seconds. This should cover most of the utterances in full, as we have seen from our query VoiceSearch logs 
, that the average utterance length is 4.3 seconds. This process will make our input sequences to be 400 samples long, at most. Truncation also speeds up the training process. We then compute unnormalized logits over the target languages through a stack of LSTM networks. The loss function is defined over these unnormalized logits for the correct label against all other wrong labels, as is depicted in Figure 1.
3.1 Network Architecture
The first layer of the network is called a concatenation layer, which concatenates every two neighboring frames, thus doubling the frame dimension and halving the sequence length. With this concatenation layer, although we result in more parameters in the bottom LSTM layer, we can significantly speed up training since we only have to unroll the LSTM 200 steps instead of 400 steps.
Following the concatenation layer is a stack of 4 LSTM layers. Each LSTM layer, except the last one, is followed by a projection layer of 256 units. The number of cells in each LSTM layer can be found in Table 1
. For example, “LSTM(1024, 256)” denotes an LSTM layer with 1024 memory cells and a projection layer with 256 cells. Our LSTM layers also have a pyramid-alike shape, where the bottom layer has 1024 memory cells and the top layer has only 256 memory cells. Experimentally, we find that both adding projection layers and reducing the size of layers further up in the network significantly speed up training and inference without hurting performance. After the LSTM layers is a temporal pooling layer that simply takes the last activation of the LSTM outputs, followed by ReLU and a linear projection to the number of languages.
|Index||Input Output Size||Layer Specification|
|0||Concatenation & subsampling|
|5||Last frame outpout|
At inference time, the user has specified a subset of languages over which our decision must be made within. Our prediction of the spoken language can be represented as:
At inference time, the utterance length can vary. Following the standard practice in speaker recognition , we divide long utterances into overlapping windows of fixed length. The final output is an average of the network response at the end of each fixed length window (shown in Figure 2):
where is the input segment for the sliding window and is the corresponding network response. This sliding windowing approach allows the system to be more robust to long utterances where the state vector of LSTM network may diverge if the LSTM is unrolled in full.
4.1 Training and Evaluation Setup
Our model is trained to distinguish between different languages. Each language has a training set varying from 1M to 60M utterances and a evaluation set of 20K utterances. All evaluation utterances and part of the training utterances are from a supervised set transcribed by human transcribers. Other training utterances are collected from anonymized user voice queries, where the language label is obtained from either user setting or a previous LID system. We remove the non-speech parts from and add procedurally generated noise to the training utterances before training. Both training and evaluation data are mostly collected from monolingual speakers.
On the evaluation data, we compare the performance of the traditional softmax loss and the proposed tuplemax loss, where in the tuplemax loss we use a tuple size of two (pairwise loss, ), since most of our multilingual traffic has two languages.
For the evaluation metrics, we care about the averaged accuracy/error for pairwise classification: we compute the accuracy/error for many two-language binary classification tasks, and report the average value on all the two-language pairs. Specifically, we report both the averaged errors on all languages pairs ( pairs in total), and on most frequent top language pairs. For example, “es-US vs. en-US” is one of the top language pairs, meaning a significant amount of our users in US speak both Spanish and English.
4.2 Pairwise Confusion Matrices
In Fig. 3 (a) and (b), we present the language classification confusion matrices of the softmax model and the tuplemax model, respectively. Each element in the confusion matrix represent the classification accuracy in a subset where the utterances have ground truth label , and the user has selected both language and language as his/her preferred languages. For example, if , it means that of the utterances with ground truth label and user preference are correctly classified as language . The diagonal elements always equal to because is meaningless.
In Fig. 3 (a), we can see there are several big “yellow” clusters on the left half of the image. This implies that softmax loss, despite being optimal for the N-class classification, leads to unbalanced classification results when . This is because pairwise accuracy is not being modeled by softmax loss function, which is similar to the practical example given in Section 2. We also notice that those “yellow” clusters do not exist in Fig. 3 (b), which indicates that the classification accuracy is well balanced for tuplemax loss.
4.3 Evaluation Results
Fig. 4 shows that when the softmax model is exposed to the standard top-one N-class classification task () can effectively produce steady improvements over time. However there is a lack of correlation top-1 classification error and evaluation error. From Fig 5, we can see tuplemax error is consistent with evaluation error.
In Fig. 6, we compare the performance of the tuplemax and softmax losses at different number of training steps. The network architecture is kept fixed for both models, softmax and tuplemax models, while the learning rate is optimized o achieve the best possible training accuracy in both cases. Results show that tuplemax loss can effectively provide steady improvements over time, reaching a value of . In contrast, the results of the softmax loss oscillate, upon convergence, around a mean error rate of . This oscillation of the pairwise error rate of the softmax model can make the process of training LID models difficult, and require constant evaluations during training. In our evaluation results, we saw that the error rate of the softmax model can increase by even a factor of 2 in checkpoints that close in time.
In Table 2 we also compare the the tuplemax model performance with the softmax model performance using multiple strategies for model checkpoint selection. At its best, the softmax model performance still falls behind the performance of the tuplemax model for both, all and top language pairs.
In this paper, we propose a novel loss function named tuplemax loss, which is designed for the Tuple-Conditioned Multi-Class Classification problem, a type of classification task where the decision is restricted to a known subset of candidates during inference time. Specifically, we focus on the language identification application, where in most multilingual scenarios, the user speaks two languages. We’ve shown in experiments that tuplemax loss is preferred over softmax loss as Tuplemax produces better and balanced results and convergence is significantly more stable.
|Loss||Checkpoint||all language||top language|
Best on test111
Picking the best checkpoint based on test data is cheating and it does not reflect the actual model performance. Averaging all checkpoints is more fair to estimate model performance. Reporting the best checkpoint on test data for softmax is only to illustrate tuplemax is better than all checkpoints produced by softmax.
-  Johan Schalkwyk, Doug Beeferman, Françoise Beaufays, Bill Byrne, Ciprian Chelba, Mike Cohen, Maryam Kamvar, and Brian Strope, ““your word is my command”: Google search by voice: A case study,” in Advances in Speech Recognition, pp. 61–90. Springer, 2010.
-  Javier Gonzalez-Dominguez, David Eustis, Ignacio Lopez-Moreno, Andrew Senior, Françoise Beaufays, and Pedro J Moreno, “A real-time end-to-end multilingual speech recognition architecture,” IEEE Journal of Selected Topics in Signal Processing, vol. 9, no. 4, pp. 749–759, 2015.
Javier Gonzalez-Dominguez, Ignacio Lopez-Moreno, Haşim Sak, Joaquin
Gonzalez-Rodriguez, and Pedro J Moreno,
“Automatic language identification using long short-term memory recurrent neural networks,”in Fifteenth Annual Conference of the International Speech Communication Association, 2014.
-  Javier Gonzalez-Dominguez, Ignacio Lopez-Moreno, Pedro J Moreno, and Joaquin Gonzalez-Rodriguez, “Frame-by-frame language identification in short utterances using deep neural networks,” Neural Networks, vol. 64, pp. 49–58, 2015.
-  Radek Fér, Pavel Matějka, František Grézl, Oldřich Plchot, and Jan Černockỳ, “Multilingual bottleneck features for language recognition,” in Sixteenth Annual Conference of the International Speech Communication Association, 2015.
-  David Martinez, Oldřich Plchot, Lukáš Burget, Ondřej Glembek, and Pavel Matějka, “Language recognition in ivectors space,” in Twelfth Annual Conference of the International Speech Communication Association, 2011.
-  David Snyder, Daniel Garcia-Romero, Alan McCree, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur, “Spoken language recognition using x-vectors,” in Odyssey: The Speaker and Language Recognition Workshop, Les Sables d’Olonne, 2018.
-  Ignacio Lopez-Moreno, Javier Gonzalez-Dominguez, Oldrich Plchot, David Martinez, Joaquin Gonzalez-Rodriguez, and Pedro Moreno, “Automatic language identification using deep neural networks,” in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 2014, pp. 5337–5341.
-  Sepp Hochreiter and Jürgen Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
-  Li Wan, Quan Wang, Alan Papir, and Ignacio Lopez-Moreno, “Generalized end-to-end loss for speaker verification,” 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4879–4883, 2018.