Attention based end to end Speech Recognition for Voice Search in Hindi and English

by   Raviraj Joshi, et al.

We describe here our work with automatic speech recognition (ASR) in the context of voice search functionality on the Flipkart e-Commerce platform. Starting with the deep learning architecture of Listen-Attend-Spell (LAS), we build upon and expand the model design and attention mechanisms to incorporate innovative approaches including multi-objective training, multi-pass training, and external rescoring using language models and phoneme based losses. We report a relative WER improvement of 15.7 models using these modifications. Overall, we report an improvement of 36.9 over the phoneme-CTC system. The paper also provides an overview of different components that can be tuned in a LAS-based system.



There are no comments yet.


page 1

page 2

page 3

page 4


Attention-Based End-to-End Speech Recognition on Voice Search

Recently, there has been an increasing interest in end-to-end speech rec...

Attention-based Contextual Language Model Adaptation for Speech Recognition

Language modeling (LM) for automatic speech recognition (ASR) does not u...

EasyASR: A Distributed Machine Learning Platform for End-to-end Automatic Speech Recognition

We present EasyASR, a distributed machine learning platform for training...

Voice Quality and Pitch Features in Transformer-Based Speech Recognition

Jitter and shimmer measurements have shown to be carriers of voice quali...

Leveraging End-to-End Speech Recognition with Neural Architecture Search

Deep neural networks (DNNs) have been demonstrated to outperform many tr...

End-to-End Automatic Speech Recognition Integrated With CTC-Based Voice Activity Detection

This paper integrates a voice activity detection (VAD) function with end...

Hybrid Autoregressive Transducer (hat)

This paper proposes and evaluates the hybrid autoregressive transducer (...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Voice search on mobile or smart devices has gained a lot of attention in the recent past and the rapid progress in the domain of speech recognition is one of the primary reasons for that. Tech giants have invested heavily in voice-based assistants such as the Google Assistant, Apple Siri, Amazon Alexa, and Microsoft Cortana (Hoy, 2018). Voice communication is also a more natural form of interaction and engagement as compared to a keyboard-based interface and requires minimal familiarity with the written form of the language.

Further, in India, keyboard-based interactions have added problems. The native languages use non-Latin scripts and moreover, the informal conversations tend to have a multi-lingual character to them. The voice interface is significantly more important for e-commerce platforms like Flipkart as it eases the product discovery process, something that has a direct impact on the revenue. The voice search in the local language can help people not comfortable in English to confidently search and order products online. Our work is concerned with Voice search in Hindi and English language primarily describing our Automatic Speech Recognition (ASR) component of the Voice search pipeline.

An ASR system performs transcription of the spoken language. It can be seen as a function mapping spoken audio to a sequence of words. While it is always desirable to directly perform this conversion from speech to words the amount of training data was a limiting factor to account for this extensive mapping. Therefore instead of modeling words as output, much smaller units called phonemes were used to learn shared representation across words. Phonemes are elementary linguistic units of sound that distinguish one word from another in a language. The model trained to convert speech (acoustic features) to phonemes is termed the acoustic model. A separate pronunciation model was used to map the sequence of phonemes to words. A language model is further employed to select the most likely sequence of words. As the amount of training data has increased end-to-end models have been used to directly convert speech to characters(grapheme), words, or sub-words.

Traditionally ASR systems were build using a combination of Hidden Markov Models(HMM) and Gaussian Mixture Models(GMM)

(Bourlard and Morgan, 2012)

. With the popularity of Deep Neural Networks (DNNs) these systems transitioned to a hybrid DNN-HMM approach

(Mohamed et al., 2011). The recent trend is towards building an all neural approach to ASR (Prabhavalkar et al., 2017)

. These end-to-end systems for ASR fuse individual acoustic, pronunciation, and language models into a single neural network model. These models directly map speech signals to grapheme (ie, the basic unit of a written language) or word or sub-word sequences. The approaches used in these models include connectionist temporal classification (CTC), recurrent neural network (RNN) transducer, and attention-based encoder-decoder

(Graves et al., 2013; Graves, 2012; Chan et al., 2015).

In this work, we focus on end-to-end models based on attention. We explore the Listen-Attend-Spell (LAS) model for the task of ASR in the e-commerce voice search domain (Chan et al., 2015). LAS is attention-based encoder-decoder architecture. The acoustic encoder in this model is termed as Listener and the decoder is termed as Speller. The speller directly emits output symbols in the form of a character or word sequence (depending on the set of subwords that it is trained on). We explore a series of modifications on top of the basic LAS model. These modifications are concerned with the training process aimed to build more robust models for a noisy environment. We employ different attention mechanisms, emphasize the importance of dropout and layer-norm, evaluate multi-pass and multi-objective training, and use external re-scoring. Although results have been reported on LSTM based LAS, the modifications can also be applied to Transformer based models (Dong et al., 2018; Karita et al., 2019; Synnaeve et al., 2019; Gulati et al., 2020).

2. Related Work

In this section, we will review literature related to ASR and Indian languages. There has been limited research in the area of Hindi speech recognition due to its low resource nature. Initial works in isolated Hindi speech recognition used HMM-based systems trained using HTK toolkit on very limited datasets (Kumar and Aggarwal, 2011; Kumar et al., 2012; Choudhary et al., 2013). The HMM-based systems were also build for continuous speech recognition in Hindi (Kumar et al., 2014, 2004).

More recent works have been biased towards building a single multi-lingual system employing end-to-end speech recognition models. These aim to exploit the usage of multiple related languages to overcome resource constraints. A single multi-lingual LAS based model was trained on 10 Indian languages in (Toshniwal et al., 2018). The resources used in this work were proprietary and consisted of around 3 lakh utterances for each language. The resources for Telugu, Tamil, and Gujarati languages were introduced in INTERSPEECH 2018, Low Resource Speech Recognition Challenge for Indian languages (Srivastava et al., 2018). The multi-lingual system submission to this shared task included GMM-HMM, Time Delay Neural Networks (TDNN), TDNN-LSTM, and RNN-CTC based approaches (Fathima et al., 2018; Vydana et al., 2018; Billa, 2018)

. Finally, transfer learning-based approaches from resource-rich to low-resource language using RNN-T based system was studied in

(Joshi et al., 2020).

Our work is specifically concerned with Hindi and English language and does not use any of the publicly available resources. The English text was transliterated to Hindi using our inhouse transliteration APIs. Hence, the output text is in the Devnagari script for both Hindi and English utterances. We built upon the recent advancements in English ASR (Prabhavalkar et al., 2017) systems and apply them to the voice search task for Hindi and English.

Figure 1. LAS Architecture

Basic LAS architecture used in our work

3. Listen-Attend-Spell

LAS is based on the encoder-decoder family of sequence-to-sequence architectures (Sutskever et al., 2014)

. These seq-2-seq models accept variable-length sequence as input and output is also a variable-length sequence. Some of the classic applications of seq2seq models are machine translation, document summarization, speech to text, and text to speech

(Jean et al., 2014; Nallapati et al., 2016; Ren et al., 2019). The encoder and decoder are typically a stack of recurrent neural networks (RNN/LSTM) although recently Transformers have replaced RNN-type units in most of the tasks. The addition of attention mechanism in the encoder-decoder framework has significantly improved the performance of these models (Bahdanau et al., 2014)

. The encoder encodes the input sequence into a sequence of latent representation. The decoder learns to attend to the relevant segment of the entire encoder representation at each output time step by means of an attention mechanism. In other words, the attention mechanism creates a context vector at each output time step (representing the attended information of the encoder sequence) to be utilized by the decoder for generating the next output token.

LAS is also an encoder-decoder architecture with attention for speech-to-text conversion. More formally it converts a sequence of input into a sequence of output tokens . The input is a sequence of spectral features - typically log mel-spectrogram of audio calculated over moving overlapping windows of the audio utterance. The output is a sequence of graphemes (or other sub-word units) in the corresponding script (Devanagari in this case). Two special symbols, the start of sequence ¡sos¿ and end of sequence ¡eos¿ are added to the start and end of the output sequence respectively. The output at each time step is modeled as a conditional distribution over previous tokens and the input signal

. The probability of the entire sequence is modeled as :


The encoder or listener is a bi-directional LSTM (Long Short Term Memory) that encodes input spectra into hidden representation



As the input sequence can be considerably long, the listener uses a pyramidal structure to reduce the overall computational complexity. This pyramidal structure involves a multi-layer Bi-LSTM that, at any layer besides the initial one, concatenates two consecutive time-steps of the previous layer to half the number of timesteps. There are four Bi-LSTM layers and the output is compressed after the second, third, and fourth layer to achieve a compression of 8 times. The following equations describe the normal Bi-LSTM layer and pyramidal Bi-LSTM denoted as pBi-LSTM. These demonstrate the computation of output for -th encoder timestep and -th layer.


The attend and spell module consumes this hidden representation and previous output token to generate a probability distribution over a set of output tokens for each output step.


The speller is a 2-layer uni-directional LSTM and the attend function uses Bahdanau Attention mechanism (Bahdanau et al., 2014). It uses the speller(decoder) hidden state and context vector to generate distribution over next token. The attention context is computed as


where runs over the encoder steps; and are the learnable parameters of the attention mechanism. The attention mechanism computes the relevance of each encoder time-step using the current decoder state. The context vector is a expected value of the encoder representations with associated probability distribution computed by the attention mechanism. The context vector is used by speller to compute the distribution over next token as follows


An abstract representation of the basic LAS architecture used in this work is shown in Fig. 1. This is slightly different from the original LAS description in terms of the number of layers and pyramidal compression. We use 5 Bi-LSTM encoder layers with pyramidal compression after every two layers. This gives us 4x compression in time dimension as opposed to 8x used in the original work. This was done because the voice search queries are comparatively shorter in length (a mean of 5.5 sec) and 4x compression was sufficient.

Some structural and optimization improvements over the base LAS were suggested in (Chiu et al., 2018)

resulting in state-of-the-art results on the Google Voice Search task. We have incorporated these changes in the base LAS model and considered it as the new baseline. This set of changes along with their corresponding hyperparameters are described below. Some of these modifications did not work for us directly we have also highlighted our design choices to tackle them.

  • Wordpiece models: The base LAS uses graphemes or character units as the output token at each step. The output units can also be entire words or phonemes. Systems using words as output units suffer from out of vocabulary (OOV) problems while the usage of phonemes requires additional grapheme to phoneme conversion systems. The middle ground is to use sub-word units that are longer than character units and do not suffer from the OOV problem. The sub-words units allow us to model a strong decoder having lower perplexity as compared to character units and also requires fewer timesteps thus reducing memory and time overhead. Given the superior performance, all the models described in this work output sub-word units. We use the Google sentence piece library to train a unigram sub-word model and use it for output tokenization (Kudo and Richardson, 2018). The dataset used to train this model is the text corresponding to the Voice Search speech to text parallel corpus. The sub-word vocabulary size used is 5k.

  • Muti-headed attention: The original implementation of the attention mechanism employed a single head to create a single context vector. Multi-headed attention uses multiple attention heads to create multiple context vectors. These context vectors are concatenated to create the final context vector. The importance of scaled dot product multi-headed attention was first shown in (Vaswani et al., 2017)

    for the task of neural machine translation. We have used multi-headed bahdanau attention instead of scaled dot product attention as it provided better performance for our ASR task. In general multi-headed attention allows the model to focus on different regions of the audio in a better way as compared to single-headed attention. The equations

    7 and 8 are modified as follows to incorporate independent attention heads indexed by .


    Each attention head has independent learnable parameters; the size of each head is 128 units. The function in the above equation is a dense layer having weights and bias as parameters.

  • Scheduled Sampling

    : The decoder of seq2seq models uses the previously predicted token and context vector to predict the next token. During training instead of using the last predicted token, the correct last ground-truth token is passed to the model. This method is known as teacher forcing and it is known to aid model training during the initial epochs. However, this also results in a disparity between training and the inference mode. During inference time we have to rely on the last predicted token. Again to bridge this gap scheduled sampling was introduced in

    (Bengio et al., 2015) for training RNN based sequence to sequence model. In this approach, at each decoder step, with a sampling probability the last predicted token will be passed as input instead of the ground truth token.

    The exact process as described in (Chiu et al., 2018) used teacher forcing for the first 1 million training steps and post this scheduled sampling is enabled with a probability of 0.4. However, we found this abrupt introduction of scheduled sampling problematic in our case leading to convergence issues. We, therefore, introduce scheduled sampling gradually.

    We have used a maximum scheduled sampling rate of and it is enabled after 20 epochs. The sampling rate is linearly increased with a rate of 0.02 per epoch from 0 to 0.3 starting from the 20th epoch. The scheduled sampling parameters - the rate of increase and the starting epoch - were empirically determined for our dataset ensuring that the model is sufficiently trained before enabling scheduled sampling.

  • Minimum Word Error Rate Training

    : The cross-entropy (CE) loss function is used in the training of LAS. The CE loss at each decoder time step between the predicted token and ground truth token is summed over for the entire sequence. However, there is a disparity between the final metric word error rate (WER) and the loss function used to optimize the network. In order to bridge this gap, a minimum word error rate (MWER) loss function is used along with the CE loss function. The MWER loss minimizes the expected value of word errors across all the possible sequences or hypotheses.


    The function gives the number of word errors between hypothesis and ground truth . The equation 14 is intractable as we need to sum over all the possible candidate hypotheses. To solve this problem N-best list approximation is used as described in (Povey, 2005). We experiment with .

  • Label Smoothing: Label smoothing is a model regularization technique aimed to make model predictions less confident (Szegedy et al., 2016). The idea is to make the output labels noisy by giving a very small non-zero probability to the non ground truth labels. This is shown to have a regularizing effect thereby reducing overfitting. The new label distribution for classes in terms of original distribution (one-hot) is given by


    The is set to 0.1 in our experiments.

  • Second pass rescoring

    : Second pass rescoring refers to a shallow fusion of the model with an external language model (LM). Although LAS architecture involves an implicit LM in the form of its contextualized decoder, it is trained on a limited amount of transcripts. We can leverage a large amount of textual unlabelled data to build a stronger LM. The usage of this external LM for rescoring the beams from the models has shown to give reasonable improvements. We train an n-gram language model using KenLM

    (Heafield, 2011) for external rescoring.

4. Connectionist Temporal Classification (CTC)

In this section, we provide a brief introduction to CTC-based models. The CTC (encoder only) model monotonically maps input sequence to output sequence of shorter length. CTC is alignment-free and does not require alignment information between audio and text. Each input frame is mapped to a target character (token). The number of input frames is typically much higher than the target tokens (each input frame would correspond to about 20ms of audio), it uses a special blank token and repeat tokens when the spoken segment corresponding to these tokens spans multiple frames. This leads to multiple possible alignments for the same output sequence. The CTC criterion sums the probabilities over all possible alignments to get the log-likelihood of the target sequence . Specifically,


where is a mapping operation that removes the blank token and repeating tokens to get the final sequence. The probability can be computed efficiently using a forward-backward algorithm. To get the frame-wise character distribution a simple LSTM network can be used. Therefore a CTC network can be visualized as a network of stacked LSTM layers trained using the CTC criterion.

5. Methods

This section describes the proposed methodology to further enhance the LAS model.

5.1. Dropout and Layer normalization

The dropout and layer norm are standard techniques used in deep learning to reduce overfitting and to improve the training time respectively (Ba et al., 2016; Srivastava et al., 2014). However, these basic techniques provide us with considerable improvements over the baseline LAS and hence we discuss them separately. The domain of voice search for e-commerce is quite large. There are millions of products that cannot be completely captured in the training data. Moreover, product terms will suffer from under-representation and over-representation in the dataset. The problem of overfitting is quite prevalent with LAS models in such circumstances. The usage of such regularization techniques, therefore, helps us move towards better generalization on unseen data. We apply a block of layer normalization followed by dropout after each Bi-LSTM layer in the encoder and on the output of the two uni-LSTM layers in the decoder before the dense layer. A dropout rate of 0.3 is found to be optimal.

5.2. Multi-pass training

The voice search data used for training is a noisy dataset. The noise can either take the form of a background sound or a secondary speaker. This motivated us to explore training using comparatively clean data sets. These datasets will be described in later sections. We, therefore, perform multi-pass training to train the LAS model. In the two-pass approach, the model is first trained using comparatively cleaner datasets followed by the training on target voice search data sets. We show that such a training scheme works comparatively better than voice search only training or mixed training. We also explore three pass training where the model is first trained using clean datasets, followed by a mix of clean and voice search data and final training on only voice search data. In this pre-training approach, we use a fixed number of epochs to perform initial training. This fixed number of epochs is set to 150 and (150 + 50) epochs for the two-pass and three-pass approaches respectively.

5.3. Multi-objective training

The LAS model is trained using the cross-entropy criterion. The advantage of LAS over the CTC-based model is that it explicitly exploits conditional dependence using the decoder architecture. The CTC-based models predict output at each encoder timestep independently. However, CTC-based models enforce monotonic alignment between audio and text which is not explicitly captured in LAS models (Kim et al., 2017). We explore joint multi-objective training using both CTC and LAS CE criteria. The encoder layers are shared between both the CTC and LAS-based models. For the CTC model, a dense layer is added on top of encoder output followed by softmax which generates the label distribution over sub-words for each encoder time-step. These models are jointly trained and the loss function is defined as


The value of is set to 0.8. During decoding, we only use the LAS network. This is also a form of multi-task learning wherein the model is also trained on auxiliary tasks to improve the generalization of the model (Ruder, 2017). In this case, the auxiliary task and parent task are the same except that they utilize different objective functions. Another form of multi-objective training is already described earlier using the MWER objective function.

5.4. Location Aware Attention

The Bahdanau attention described in equations 6, 7, and 8 can also be termed as content-based attention. The attention vector thus computed is dependent upon the values of key and query vectors. A complementary attention mechanism that takes the location of previous attention into consideration was described in (Chorowski et al., 2015). This form of attention is more suitable for ASR tasks because of the monotonic alignment between speech and text. The location-aware attention is formulated as follows:


The denotes the previous alignment vector. The vector is convolved with the matrix to get features corresponding to each encoder time step. The features are then passed to the energy function along with query vector and key vector . The variables are the parameters of the model. This attention mechanism is also used in a multi-headed setup as described earlier.

5.5. Phoneme-CTC based rescoring

The output of LAS-based models is not explicitly dependent on the length of the audio. The model relies on the generation of ¡eos¿ token to mark the end of the sentence. This might cause the LAS model to generate extra undesirable tokens or miss on some tokens. To overcome this problem we use a CTC-based model trained to predict phoneme units to re-score the beams generated by the LAS model. The phoneme-based model is expected to penalize the addition of unwanted characters or deletion of required characters. It consists of 5 layer LSTM architecture (700 units) followed by a dense layer. This is used in conjunction with the external language model for final rescoring. The final score for each beam element is computed using the following equation


The values for and are determined using grid search on a separate validation set.

Model Rescoring clean test noisy test

clean test +

noisy test

Phoneme CTC system 12.77 16.9 14.87
Base LAS (single pass) 10.94 13.03 12.0
9.95 12.25 11.12
Base LAS (single pass) dropout + layernorm 9.49 11.56 10.54
9.12 11.16 10.15
LAS + two pass + dropout + layernorm (LASR) 8.87 10.58 9.74
8.63 10.37 9.51
LASR + Location Attention 8.43 10.41 9.44
LASR + Joint CTC objective 8.42 10.5 9.47
LAS + three pass + dropout + layernorm 8.5 10.22 9.37
Table 1. Word Error Rate(WER) for different model variations

6. Experiments

6.1. Dataset Details

Two datasets are used in this work viz. general domain speech data and Flipkart voice search app data. Flipkart is a primary e-commerce player in India and severs products in electronics, fashion, beauty, grocery, lifestyle, and other categories. The voice search app data consists of audio samples collected from the Flipkart application manually transcribed by the operations team. It consists of approximately 3 million samples with equal distribution of Hindi and English sentences. The general domain data is a comparatively clean dataset and has around 4 million crowdsourced samples. Taken together we have approximately 4000 hours of VS data and 6500 hours of general domain data. For experiments involving two-pass training, the initial training is done on general data followed by noisy VS app data. In the three-pass approach, general data training is followed by (general data + VS app data) and final only VS app data training.

The test and validation data consists of 9593 and 10000 samples respectively from the voice search domain. The test data is further divided into two categories depending upon the presence of noise or secondary speaker in a test sample. The first category is termed as ’clean test’ and consists of only intelligible primary speaker speech. The second set, termed ’noisy test’, consists of a primary speaker and secondary speaker but the secondary speech is not intelligible. The first set consists of 4675 examples and the second has 4918 examples. We do not consider the cases where secondary speech is intelligible.

6.2. Preprocessing

The input features to the model are standard log-Mel spectrogram. The spectrogram is computed using a window size of 20 ms and an overlap of 10 ms. The number of mel filterbank coefficients used is 80. We augment the input features by applying time and frequency masking as described in (Park et al., 2019). A random mel-frequency range of is masked and time range of is masked using this process. The parameters used for masking are and . Features from three consecutive audio frames are concatenated to give a single encoder timestep feature of size 240.

Figure 2. Sample output from the best LAS and phoneme-CTC system. GT Text is the ground truth text. The in GT indicates the presence of background noise. The utterances for which the LAS model performs better are placed at the top whereas the utterances transcribed correctly by the phoneme-CTC system are placed at the bottom. The red color indicates incorrect token (modification or addition) and the orange color indicates its correct counterpart.

6.3. Results and Discussion

Table 1 describes the results of the experiments discussed in Section 3 and 5. The word error rate (WER) on test data is used to compare these model variations. The pyramidal LAS model along with modifications like scheduled sampling, label smoothing, MHA, and sub-word unit outputs is termed as Base LAS. The baseline system uses the phoneme-CTC model and its WER numbers are reported for comparison. It was made sure that these systems were trained on the same data with the same preprocessing and post-processing in order to have a fair comparison between these systems. The inference was done on GPU for all the systems and had comparable latency requirements. We discuss the relative improvements with the model modifications on the entire test set. During decoding standard beam-search algorithm was used with a beam size of 10.

The Base LAS system reports 25.2% relative performance improvement over the phoneme-CTC system. We see that the addition of dropout and layer norm to the Base LAS system improves the performance by 8.7%. The addition of two-pass training utilizing transcription data further improves the results by 6.3%. The multi-objective training and location-aware attention were independently added on top of this two-pass model. These additions provide a small improvement of around 0.7%. Finally, the three training approach provides an improvement of 1.5% as compared to the two-pass approach.

We also see that external rescoring provides an improvement of 7.3% on the Base LAS system and 2.3% on the two-pass system. The phoneme and LM-based rescoring contribute equally to these improvements.

Overall the best configuration reports an improvement of 36.9% over the phoneme-CTC system and 15.7% over the Base LAS model. Also, note that relative improvements on clean data are 33.4% (wrt prod) and 14.57% (wrt Base LAS) as opposed to 39.5% and 16.5% on the noisy data. This shows that LAS-based models and the modifications are more robust to noisy data. Some sample transcriptions from the LAS model and phoneme-CTC system are shown in Fig. 2.

7. Conclusion

In conclusion, we demonstrate the superior performance of attention-based speech recognition models on the Flipkart Voice Search Task. We incorporate a series of enhancements on top of the latest Listen-Attend-Spell model in the form of regularization, multi-pass training, multi-objective training, and external rescoring. We show that these simple yet effective structural cum data-related modifications give us a significant boost in performance.


  • J. L. Ba, J. R. Kiros, and G. E. Hinton (2016) Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: §5.1.
  • D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §3, §3.
  • S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer (2015) Scheduled sampling for sequence prediction with recurrent neural networks. arXiv preprint arXiv:1506.03099. Cited by: 3rd item.
  • J. Billa (2018) ISI asr system for the low resource speech recognition challenge for indian languages.. In INTERSPEECH, pp. 3207–3211. Cited by: §2.
  • H. A. Bourlard and N. Morgan (2012) Connectionist speech recognition: a hybrid approach. Vol. 247, Springer Science & Business Media. Cited by: §1.
  • W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals (2015) Listen, attend and spell. arXiv preprint arXiv:1508.01211. Cited by: §1, §1.
  • C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R. J. Weiss, K. Rao, E. Gonina, et al. (2018) State-of-the-art speech recognition with sequence-to-sequence models. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4774–4778. Cited by: 3rd item, §3.
  • J. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio (2015) Attention-based models for speech recognition. arXiv preprint arXiv:1506.07503. Cited by: §5.4.
  • A. Choudhary, M. Chauhan, and M. G. Gupta (2013)

    Automatic speech recognition system for isolated and connected words of hindi language by using hidden markov model toolkit (htk)

    Association of computer electronics and electrical engineers (ACEEE). Cited by: §2.
  • L. Dong, S. Xu, and B. Xu (2018) Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5884–5888. Cited by: §1.
  • N. Fathima, T. Patel, C. Mahima, and A. Iyengar (2018) TDNN-based multilingual speech recognition system for low resource indian languages.. In INTERSPEECH, pp. 3197–3201. Cited by: §2.
  • A. Graves, A. Mohamed, and G. Hinton (2013) Speech recognition with deep recurrent neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing, pp. 6645–6649. Cited by: §1.
  • A. Graves (2012) Sequence transduction with recurrent neural networks. arXiv preprint arXiv:1211.3711. Cited by: §1.
  • A. Gulati, J. Qin, C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, et al. (2020) Conformer: convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100. Cited by: §1.
  • K. Heafield (2011) KenLM: faster and smaller language model queries. In Proceedings of the sixth workshop on statistical machine translation, pp. 187–197. Cited by: 6th item.
  • M. B. Hoy (2018) Alexa, siri, cortana, and more: an introduction to voice assistants. Medical reference services quarterly 37 (1), pp. 81–88. Cited by: §1.
  • S. Jean, K. Cho, R. Memisevic, and Y. Bengio (2014) On using very large target vocabulary for neural machine translation. arXiv preprint arXiv:1412.2007. Cited by: §3.
  • V. Joshi, R. Zhao, R. R. Mehta, K. Kumar, and J. Li (2020) Transfer learning approaches for streaming end-to-end speech recognition system. arXiv preprint arXiv:2008.05086. Cited by: §2.
  • S. Karita, N. Chen, T. Hayashi, T. Hori, H. Inaguma, Z. Jiang, M. Someki, N. E. Y. Soplin, R. Yamamoto, X. Wang, et al. (2019) A comparative study on transformer vs rnn in speech applications. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 449–456. Cited by: §1.
  • S. Kim, T. Hori, and S. Watanabe (2017) Joint ctc-attention based end-to-end speech recognition using multi-task learning. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4835–4839. Cited by: §5.3.
  • T. Kudo and J. Richardson (2018) Sentencepiece: a simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226. Cited by: 1st item.
  • A. Kumar, M. Dua, and T. Choudhary (2014) Continuous hindi speech recognition using monophone based acoustic modeling. International Journal of Computer Applications 24. Cited by: §2.
  • K. Kumar, R. Aggarwal, and A. Jain (2012) A hindi speech recognition system for connected words using htk. International Journal of Computational Systems Engineering 1 (1), pp. 25–32. Cited by: §2.
  • K. Kumar and R. Aggarwal (2011) Hindi speech recognition system using htk. International Journal of Computing and Business Research 2 (2), pp. 2229–6166. Cited by: §2.
  • M. Kumar, N. Rajput, and A. Verma (2004) A large-vocabulary continuous speech recognition system for hindi. IBM journal of research and development 48 (5.6), pp. 703–715. Cited by: §2.
  • A. Mohamed, G. E. Dahl, and G. Hinton (2011)

    Acoustic modeling using deep belief networks

    IEEE transactions on audio, speech, and language processing 20 (1), pp. 14–22. Cited by: §1.
  • R. Nallapati, B. Zhou, C. Gulcehre, B. Xiang, et al. (2016) Abstractive text summarization using sequence-to-sequence rnns and beyond. arXiv preprint arXiv:1602.06023. Cited by: §3.
  • D. S. Park, W. Chan, Y. Zhang, C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le (2019) Specaugment: a simple data augmentation method for automatic speech recognition. arXiv preprint arXiv:1904.08779. Cited by: §6.2.
  • D. Povey (2005) Discriminative training for large vocabulary speech recognition. Ph.D. Thesis, University of Cambridge. Cited by: 4th item.
  • R. Prabhavalkar, K. Rao, T. N. Sainath, B. Li, L. Johnson, and N. Jaitly (2017) A comparison of sequence-to-sequence models for speech recognition.. In Interspeech, pp. 939–943. Cited by: §1, §2.
  • Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T. Liu (2019) Fastspeech: fast, robust and controllable text to speech. arXiv preprint arXiv:1905.09263. Cited by: §3.
  • S. Ruder (2017) An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098. Cited by: §5.3.
  • B. M. L. Srivastava, S. Sitaram, R. K. Mehta, K. D. Mohan, P. Matani, S. Satpal, K. Bali, R. Srikanth, and N. Nayak (2018) Interspeech 2018 low resource automatic speech recognition challenge for indian languages.. In SLTU, pp. 11–14. Cited by: §2.
  • N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting.

    The journal of machine learning research

    15 (1), pp. 1929–1958.
    Cited by: §5.1.
  • I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. arXiv preprint arXiv:1409.3215. Cited by: §3.
  • G. Synnaeve, Q. Xu, J. Kahn, T. Likhomanenko, E. Grave, V. Pratap, A. Sriram, V. Liptchinsky, and R. Collobert (2019)

    End-to-end asr: from supervised to semi-supervised learning with modern architectures

    arXiv preprint arXiv:1911.08460. Cited by: §1.
  • C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2016)

    Rethinking the inception architecture for computer vision


    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 2818–2826. Cited by: 5th item.
  • S. Toshniwal, T. N. Sainath, R. J. Weiss, B. Li, P. Moreno, E. Weinstein, and K. Rao (2018) Multilingual speech recognition with a single end-to-end model. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4904–4908. Cited by: §2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. arXiv preprint arXiv:1706.03762. Cited by: 2nd item.
  • H. K. Vydana, K. Gurugubelli, V. V. R. Vegesna, and A. K. Vuppala (2018) An exploration towards joint acoustic modeling for indian languages: iiit-h submission for low resource speech recognition challenge for indian languages, interspeech 2018.. In INTERSPEECH, pp. 3192–3196. Cited by: §2.