model was applied to machine translation task, the model contains an encoder neural network with recurrent layers, to encode the entire input sequence (i.e the message in a source) into a internal fixed-length vector representation. This vector is an input to the decoder – another set of recurrent layers with final softmax layer, which, in each recurrent iterations, predicts probabilities for the next symbol of the output sequence (i.e. the message in the target). This work deals with the task of automatic speech recognition (ASR), where the seq2seq model is used to map a sequence of speech features into a sequence of characters. In particular, we use attention based seq2seq model, where the encoder encodes an input sequence into another internal sequence of the same length. The attention mechanism 
, then focuses on the relevant portion of the internal sequence in order to predict each next output symbol using the decoder. The seq2seq are typically trained to maximize the conditional likelihood (or minimize cross-entropy) of the correct output symbols. For predicting a current character, the previous character (e.g. its one-hot encoding) from ground truth sequence is typically fed as an auxiliary input to decoder during training. This so-called teacher-forcing helped the decoder to learn an internal language model (LM) for the output sequences. During normal decoding, the last predicted character is fed back instead of the unavailable ground truth. Using such training strategy, attention 
based seq2seq model has shown to absorb and jointly learn all the components of a traditional ASR system (i.e. acoustic model, lexicon and language model. Two major drawbacks have been, however, identified with the training strategy described above:
Exposure bias: The seq2seq training uses teacher forcing, where each output character is conditioned on the previous true character. However during testing, the model needs to rely on its own previous predictions. This mismatch between training and testing leads to poor generalization and is referred to as exposure bias [5, 6].
Error criterion mismatch: Another potential issue is mismatch in error criterion between training and testing [7, 8]. ASR, uses character error rate (CER) or word error rate (WER) to validate the decoded output while the training objective is the conditional maximum likelihood (cross entropy) maximizing the probability of the correct sequence.
In this work, we first experiment with training objectives that better matches the CER or WER metric, namely minimum Bayes risk (MBR)  and softmax margin . We show that such choice of training objective makes teacher-forcing strategy unnecessary and therefore effectively addresses both the aforementioned problems.
Both MBR and softmax margin objective needs to consider alternative sequences (hypotheses) besides the ground truth sequence. Unfortunately, seq2seq model does not make Markov assumptions and the alternative sequences cannot be efficiently represented with a lattice. Instead, we perform beam search to generate an (approximate) N-best list of alternative sequences. However, with the limited capacity of the N-best representation, some of the important hypotheses (i.e. sequences with a low error rate) can be easily pruned out by the beam search, which might result in less effective training. To address this problem, we propose a new training strategy, which we call promising accurate prefix boosting (PAPB): The beam search keeps list of promising prefixes (partial sequences) of the output sequence, which get extended by one character at each decoding iteration. In each iteration, we update parameters of the seq2seq model to boost probabilities of such promising prefixes that are also accurate (i.e. partial sequences with low edit distance to partial ground truth). This is accomplished by using the softmax margin objective (and updates) not only for the whole sequences, but also for all the partial sequence obtained during the decoding.
There are existing works addressing the exposure bias or the error criterion mismatch problem with seq2seq models applied to natural language processing (NLP) problem. For example, scheduled sampling and SEARN  handle the exposure bias by choosing either the model predictions or the true labels as the feedback to the decoder. The error criterion mismatch is handled using task loss minimization  using an edit-distance, RNN transducer  based expected error rate minimization, and minimum risk criterion based recurrent neural aligner 
. Few works consider both the problems simultaneously: learning from character sampled from the model using reinforcement learning and actor-critic algorithm . Our work is mostly inspired by beam search optimization (BSO) , where max-margin loss is used as sequence-level objective  for the machine translation task. All the mentioned works were applied to NLP problems, while the focus of this work is ASR. Also, none of the works considered the prefixes (partial sequences) during training. A recent work on seq2seq based ASR was trained with MBR objective  using N-best hypotheses obtained from a beam search. However, this work also did not consider the prefixes. Finally, optimal completion distillation 
technique focuses on prefix learning, but it uses complex learning methods such as policy distillation and imitation learning.
With the attention based Encoder-Decoder  architecture, the encoder neural network provides an internal representations of an input sequence , where
is the number of frames in an utterance. In this work, the encoder is a recurrent network with bi-directional long short-term memory (BLSTM) layers[17, 18]. To predict the -th output symbol, the attention component takes the sequence and the previous hidden state of the decoder as the input and produces per-frame attention weights
In this work, we use location aware attention . The attention weights are expected to have high values for the frames that we need to pay attention to for predicting the current output and are typically normalized to sum-up to one over frames. Using such weights, the weighted average of the internal sequence serves as an attention summary vector
The decoder is a recurrent network with LSTM layers, which receives along with the previous predicted output character
(e.g. its one-hot encodding) as the input and estimates the hidden state vector
This vector is further subject to an affine transformation (LinB) and Softmax non-linearity to obtain the probabilies for the current output symbol :
The probability of a whole sequence is
To decode the output sequence, simple greedy search can be performed, where the most likely symbol is chosen according to (5) in each decoding iteration until the dedicated end-of-sentence symbol is decoded. This procedure, however, does not guarantee in finding the most likely sequence according to (6). To find the optimal sequence, multiple hypotheses explored by beam search usually provides better results. Note, however, that each partial hypothesis in the beam search has its own hidden state (3) as it depends on the previously predicted symbols in that hypothesis.
During training, model parameters are typically updated to minimize cross-entropy (CE) loss for correct output :
This is particularly easy with the teacher forcing, when the symbol from the ground truth sequence is always used in (3) as the previously predicted symbol and, therefore, no alternative hypotheses needs to be considered.
3 Training criterion
We compare our proposed PAPB approach with two other objective functions that serves as our baseline. Namely, we use minimum Bayes risk criterion and softmax margin loss, which both perform sequence level discriminative training. Both objectives need to estimate character error rate (CER) for alternative hypotheses, which are explored using beam search. In the following equations, the symbol denotes the edit distance between the ground truth sequence and hypothesized sequence .
3.1 Minimum Bayes risk (MBR)
In practice, the total set of hypotheses generated is reduced to -best hypotheses for computational efficiency. MBR training objective effectively performs sequence level discriminative training in ASR  and provides substantial gains when used as secondary objective after performing cross-entropy loss based optimization .
3.2 Softmax margin loss (SM)
where is a tunable margin factor () and the un-normalized score of a chosen sequence,
is the sum of the pre-softmax outputs from (4). Note the dependence of the scores on the chosen hypothesis through the predictions fed back to decoder in (3), which is not explicitly denoted in our notation. The function aims to boost the score of the true sequence, , to stay above the other hypotheses with a margin defined by CER of the alternative hypotheses.
4 Promising accurate prefix boosting (PAPB)
In PAPB, we perform training at prefix level in similar fashion to decoding, by incorporating an appropriate training objective with beam search. The primary motivation to carry out prefix level training is because, seq2seq models predict a sequence, character by character. MBR aims to improves the score of the completed hypothesis with less error, but it might get pruned out during the beam search. However, in our approach, the model will be exposed to all prefixes obtained from N-best as generated by beam search and optimized to maximize the scores of true hypothesis . In brief, we consider not only the fully completed hypotheses, but also prefixes so that the promising prefixes with low error keep scoring high and therefore are likely to survive the pruning. The loss is computed for each prefix by modifying the softmax margin loss as:
where and denotes the -best set hypothesis obtained using beam search. In equation (11), the prefix scores of predicted hypotheses and true hypothesis are computed by summing the scores given by (4) from to , while, in the standard sequence objective , the summation is performed only across a whole sequence as noted in (10). The contributions of our proposed approach are as follows:
The output scores (as in (4)) are computed for each character conditioned on the previous character from the corresponding explored hypothesis (i.e. no teacher-forcing used).
In our experiments , we select the hypothesis from N-best that obtains the lowest CER as the pseudo-true hypothesis to compute the score , instead of using the true hypothesis. This is to avoid harmful effects during model training by abruptly including the true sequence into the beam, which might have very small score. Defining the true hypothesis with a pseudo-true hypothesis brings our objective analogous to MBR criterion where very unlikely hypotheses do not affect model parameter updates.
The is calculated using edit-distance between the prefixes and . Here, the number of characters are kept equal between true prefix and prefix hypothesis , which, according to our assumption, should contribute to reduction of insertion and deletion errors.
5 Experimental setup
Database details: Voxforge-Italian  and Wall Street Journal (WSJ)  corpora were used for our experimental analysis. Voxforge-Italian is a broadband speech corpus (16 hours) and is split into 80%, 10% and 10% to training, development, and evaluation sets by ensuring that no sentence was repeated in any of the sets. WSJ with 284 speakers comprising 37,416 utterances (82 hours of speech) is used for training, and eval92 test set is used for decoding.
Training: Filter-bank features containing 83 dimensional (80 Mel-filter bank coefficients plus 3 pitch features) coefficients are used as input. In this work, the encoder-decoder model is aligned and trained using attention based approach. Location aware attention  is used in our experiments. For WSJ experiments, the encoder comprises 3 bi-directional LSTM layers [18, 17] each with 1024 units and the decoder comprises 2 (uni-directional) LSTM layers with 1024 units. For VoxForge experiments, the encoder comprises 3 bi-directional LSTM layers with 320 units and the decoder contains one LSTM layer with 320 units. The CE training is optimized using AdaDelta  optimizer with an initial learning rate set to
. The training batch size is 30 and the number of training epochs is 20. The learning rate decay is based on the validation performance computed using the character error rate (min. edit distance). ESPnet is used to implement and execute all our experiments. The MBR, softmax margin and prefix training configuration has initial learning rate , the number of training epochs is set to 10 and the batch-size to 10. The beam-size for training and testing is set to 10. The model weights are initialized with pre-trained CE model. The rest of configuration is kept the same as for CE training. In our experiments, we use a modified MBR objective:
which is a weighted combination of the original MBR objective (8) and CE objective (12). Adding the CE objective is analogous to f-smoothing  and provides gains when applied for seq2seq models . Similarly, we also use a modified prefix boosting objective
where is the CE objective weight empirically set to for both MBR (also noted in ) and prefix training experiments. Altering the did not show much difference in performance.
External language model for WSJ: Beside the internal language train by the decoder, we have experimented with an external RNN language model (RNNLM)  is trained on the text used for seq2seq training along with additional text data accounting to 1.6 million utterances from WSJ corpus. Both character and word-level language models are used in our experiments. The vocabulary size is 50 for character LM and 65k for word LM. The word level RNNLM is trained using 1 recurrent layer of 1000 units, 300 batch size and SGD optimizer. The character level RNNLM configuration contains 2 recurrent layers of 650 units, 1024 batch size and uses Adam  optimizer.
6 Results and discussion
We started our initial investigation with Voxforge-Italian dataset and later tested our method on WSJ.
6.1 Comparison with scheduled sampling (SS)
Our best performing CE baseline model is with 50% SS (which denotes 50 % true labels) as mentioned in the first row in Table 1. The SS-50% model is compared with SS-0% (0% true labels) to investigate the impact of only feeding model predictions. The second row shows that WER of SS-0% degrades by 8.2 % on test set and by 5% on dev set compared to SS-50% model.
|SS-50% (from random init.)||50.9||52.3|
|SS-0% (from random init.)||59.1||57.3|
|SS-0% (fine-tuned from SS-50%)||52.9||52.5|
The SS-0% re-trained using weights initialized from SS-50% model (acts as prior), still resulted in performance degradation but the gap got reduced to 2.0% on test set and 2.2% on dev set. These results highlight the limitation of using scheduled sampling with 0% true labels (or 100% model predictions) as it lead to loss in recognition performance. Thus, a need to use a specific objective which can train only with model predictions is necessary and justifies the focus of this paper.
6.2 Comparison of PAPB with MBR
Table 1 shows that the performance of both MBR and softmax margin loss objectives are comparable to each other.
While MBR and softmax margin loss provide considerable gains over scheduled sampling, they do not consider the prefixes (partial sequences generated during beam search) for training. In the following experiment, we show that the performance of PAPB justify our intuition to use prefix information by providing improvement from 49.9 % to 47.4 % WER on test set and from 50.8 % to 47.7 % WER on dev set compared to MBR objective. PAPB shows an improvement of 2.7 % WER for both test and dev sets compared to softmax margin loss. Figure 1 also shows a similar effect of PAPB noticed during training, by gaining better CER over MBR and softmax margin objectives.
6.3 Effect of varying beam-size during training and testing
Further analysis on prefix training method is performed to understand the impact of beam-size used during training and testing. The beam-size decides the number of hypotheses to retain during beam search and is denoted as N-best. The results obtained by varying this hyper-parameter showcases the importance of using multiple hypotheses in the loss objective. Table 2 introduces the effect of retaining best paths (2,5, and 10) during training and testing . A noticeable pattern observed in our experiments is that increasing the beam-size led to significant improvement in performance. Further, increase in beam-size did not provide considerable gains.
6.4 Results on WSJ
The results in Table 3 showcase the importance of using character level, word level RNNLM over no RNNLM.
For decoding with RNNLM, we use look-ahead word LM decoding procedure recently introduced in  to integrate the word based RNNLM and the character RNNLM is decoded by following the procedure in . The LM weight is optimized to show the impact of language model across CE, MBR and our proposed PAPB models. The Table 3 also show that results of PAPB and MBR shows complementary effect with both word and character LM.
In this paper, we proposed PAPB, a strategy to train on
-best partial sequences generated using beam search. This method suggests that improving the hypothesis at prefix level can attain better model predictions for refining the feed back to predict next character. The softmax margin loss function is inherited in our approach to serve this purpose. The experimental results shows the efficacy of the proposed approach compared to CE and MBR objectives with consistent gains across two datasets. The PAPB also has its drawbacks in-terms of time complexity, as it consumes 20% more training time compared to CE training. This work can be further extended to use complete set of lattices instead of N-best list by exploiting the capabilities of GPU for improving time complexity. Also, modified MBR training objective in-place of softmax margin objective can be used to learn prefixes.
-  I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Advances in neural information processing systems, pp. 3104–3112, 2014.
-  D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014.
R. J. Williams and D. Zipser, “A learning algorithm for continually running fully recurrent neural networks,”Neural computation, vol. 1, no. 2, pp. 270–280, 1989.
-  J. Chorowski, D. Bahdanau, K. Cho, and Y. Bengio, “End-to-end continuous speech recognition using attention-based recurrent NN: first results,” arXiv preprint arXiv:1412.1602, 2014.
-  H. D. III, J. Langford, and D. Marcu, “Search-based structured prediction,” CoRR, vol. abs/0907.0786, 2009.
-  S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer, “Scheduled sampling for sequence prediction with recurrent neural networks,” in Advances in Neural Information Processing Systems, pp. 1171–1179, 2015.
-  H. Sak, M. Shannon, K. Rao, and F. Beaufays, “Recurrent neural aligner: An encoder-decoder neural network model for sequence to sequence mapping,” in Proc. Interspeech, pp. 1298–1302, 2017.
-  R. Prabhavalkar, T. N. Sainath, Y. Wu, P. Nguyen, Z. Chen, C.-C. Chiu, and A. Kannan, “Minimum word error rate training for attention-based sequence-to-sequence models,” in ICASSP, 2018, pp. 4839–4843, IEEE, 2018.
-  K. Gimpel and N. A. Smith, “Softmax-margin training for structured log-linear models,” 2010.
-  D. Bahdanau, D. Serdyuk, P. Brakel, N. R. Ke, J. Chorowski, A. Courville, and Y. Bengio, “Task loss estimation for sequence prediction,” arXiv preprint arXiv:1511.06456, 2015.
-  A. Graves and N. Jaitly, “Towards end-to-end speech recognition with recurrent neural networks.,” in ICML, vol. 14, pp. 1764–1772, 2014.
-  M. Ranzato, S. Chopra, M. Auli, and W. Zaremba, “Sequence level training with recurrent neural networks,” arXiv preprint arXiv:1511.06732, 2015.
-  D. Bahdanau, P. Brakel, K. Xu, A. Goyal, R. Lowe, J. Pineau, A. Courville, and Y. Bengio, “An actor-critic algorithm for sequence prediction,” arXiv preprint arXiv:1607.07086, 2016.
-  S. Wiseman and A. M. Rush, “Sequence-to-sequence learning as beam-search optimization,” arXiv preprint arXiv:1606.02960, 2016.
I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun, “Large margin
methods for structured and interdependent output variables,”
Journal of machine learning research, vol. 6, no. Sep, pp. 1453–1484, 2005.
-  S. Sabour, W. Chan, and M. Norouzi, “Optimal completion distillation for sequence learning,” arXiv preprint arXiv:1810.01398, 2018.
-  S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks,”IEEE Transactions on Signal Processing, vol. 45, no. 11, pp. 2673–2681, 1997.
-  D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y. Bengio, “End-to-end attention-based large vocabulary speech recognition,” in ICASSP, 2016, pp. 4945–4949, IEEE, 2016.
-  D. Povey and P. C. Woodland, “Minimum phone error and i-smoothing for improved discriminative training,” in ICASSP, 2002, vol. 1, pp. I–105, IEEE, 2002.
-  D. Povey, D. Kanevsky, B. Kingsbury, B. Ramabhadran, G. Saon, and K. Visweswariah, “Boosted MMI for model and feature-space discriminative training,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4057–4060, IEEE, 2008.
-  K. Veselỳ, A. Ghoshal, L. Burget, and D. Povey, “Sequence-discriminative training of deep neural networks.,” in INTERSPEECH, pp. 2345–2349, 2013.
-  H. Su, G. Li, D. Yu, and F. Seide, “Error back propagation for sequence training of context-dependent deep networks for conversational speech transcription,” in ICASSP, 2013, pp. 6664–6668, IEEE, 2013.
-  C. H. Lampert, “Maximum margin multi-label structured prediction,” in Advances in Neural Information Processing Systems 24 (J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, eds.), pp. 289–297, Curran Associates, Inc., 2011.
-  “Voxforge.org, ”Free speech recognition”.” http://www.voxforge.org/. Accessed: 2014-06-25.
-  D. B. Paul and J. M. Baker, “The design for the Wall Street Journal-based CSR corpus,” in Proc. of the workshop on Speech and Natural Language, pp. 357–362, Association for Computational Linguistics, 1992.
-  M. D. Zeiler, “ADADELTA: an adaptive learning rate method,” CoRR, vol. abs/1212.5701, 2012.
-  S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N. E. Y. Soplin, J. Heymann, M. Wiesner, N. Chen, et al., “Espnet: End-to-end speech processing toolkit,” arXiv preprint arXiv:1804.00015, 2018.
-  T. Mikolov, M. Karafiát, L. Burget, J. Černockỳ, and S. Khudanpur, “Recurrent neural network based language model,” in Eleventh Annual Conference of the International Speech Communication Association, 2010.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2014.
-  T. Hori, J. Cho, and S. Watanabe, “End-to-end speech recognition with word-based RNN language models,” arXiv preprint arXiv:1808.02608, 2018.
-  S. Watanabe, T. Hori, S. Kim, J. R. Hershey, and T. Hayashi, “Hybrid CTC/attention architecture for end-to-end speech recognition,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1240–1253, 2017.
-  Y. Zhang, W. Chan, and N. Jaitly, “Very deep convolutional networks for end-to-end speech recognition,” in ICASSP, 2017, pp. 4845–4849, IEEE, 2017.
-  H. Hadian, H. Sameti, D. Povey, and S. Khudanpur, “End-to-end speech recognition using lattice-free MMI,” Proc. Interspeech 2018, pp. 12–16, 2018.