Improving Readability for Automatic Speech Recognition Transcription

04/09/2020 ∙ by Junwei Liao, et al. ∙ Microsoft 0

Modern Automatic Speech Recognition (ASR) systems can achieve high performance in terms of recognition accuracy. However, a perfectly accurate transcript still can be challenging to read due to grammatical errors, disfluency, and other errata common in spoken communication. Many downstream tasks and human readers rely on the output of the ASR system; therefore, errors introduced by the speaker and ASR system alike will be propagated to the next task in the pipeline. In this work, we propose a novel NLP task called ASR post-processing for readability (APR) that aims to transform the noisy ASR output into a readable text for humans and downstream tasks while maintaining the semantic meaning of the speaker. In addition, we describe a method to address the lack of task-specific data by synthesizing examples for the APR task using the datasets collected for Grammatical Error Correction (GEC) followed by text-to-speech (TTS) and ASR. Furthermore, we propose metrics borrowed from similar tasks to evaluate performance on the APR task. We compare fine-tuned models based on several open-sourced and adapted pre-trained models with the traditional pipeline method. Our results suggest that finetuned models improve the performance on the APR task significantly, hinting at the potential benefits of using APR systems. We hope that the read, understand, and rewrite approach of our work can serve as a basis that many NLP tasks and human readers can benefit from.



There are no comments yet.


page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With the rapid development of speech-to-text technologies, ASR systems have achieved high recognition accuracy, even beating the performance of professional human transcribers on conversational telephone speech in terms of Word Error Rate (WER) (Xiong et al., 2018).

Automatic speech recognition systems bring convenience to users in many scenarios. However, colloquial speech is fraught with syntactic and grammatical errors, disfluency, informal words, and other noises that make it difficult to understand. While ASR systems do a great job in recognizing which words are said, its verbatim transcription creates many problems for modern applications that must comprehend the meaning and intent of what is said. Applications such as automatic subtitle generation and meeting minutes generation require automatic speech transcription that is highly readable for humans, while machine translation, dialogue systems, voice search, voice question answering, and many other applications require highly readable transcriptions to generate the best machine response. The existence of the defects in speech transcription will significantly harm the experience of the application users if the system cannot handle them well.

Inspired by the latest progress in natural language generation (NLG), grammatical error correction (GEC), machine translation, and transfer learning, we explore the idea of “understanding then rewriting” as a new ASR post-processing concept to provide conversion from raw ASR transcripts to error-free and highly readable text.

We propose ASR post-processing for readability (APR), which aims to transform the ASR output into a readable text for humans and downstream NLP tasks. Readability in this context refers to having proper segmentation, capitalization, fluency, and grammar, as well as properly formatted dates, times, and other numerical entities. Post-processing can be treated as a style transfer, converting informal speech to formal written language.

Due to the lack of relevant data, we constructed a dataset for the APR task using a GEC dataset as seed data. The GEC dataset is composed of pairs of grammatically incorrect sentences and corresponding sentences corrected by a human annotator. First, we used a text-to-speech (TTS) system to convert the ungrammatical sentences to speech. Then, we used an ASR system to transcribe the TTS output. Finally, we used the output of the ASR system and the original grammatical sentences to create the data pairs. By this means, we produced 1.1 million APR samples that are used for training and testing.

We investigated three mainstream Transformer-based sequence-to-sequence neural network architectures for the APR task. Specifically, we investigated MASS 

(Song et al., 2019), UniLM (Dong et al., 2019) and RoBERTa (Liu et al., 2019), which are pre-trained models used for NLG and/or NLU tasks. We also attempted to leverage the advantages of both RoBERTa and UniLM by adapting the RoBERTa pre-trained model towards generative using a modified UniLM training approach (RoBERTa-UniLM).

We used several metrics to evaluate the four fine-tuned models on our APR dataset: readability-aware WER (RA-WER), BLEU, MaxMatch (M), and GLEU. The results show that the fine-tuned models outperform the baseline method significantly in terms of readability.

Our main contributions can be summarized as follows:

  • We propose a novel task: ASR post-processing for readability (APR). It aims to solve the shortcomings of the traditional post-processing concept/methods by jointly performing error correction and readability improvements in one step.

  • We describe a method to construct a dataset for the APR task.

  • We experiment using state-of-the-art pre-trained models on the proposed APR dataset and achieved significant improvement on all metrics.

  • We adapt RoBERTa as a generative model trained in the style of UniLM which shows its benefits on some metrics such as M and BLEU.

2 Related Work

2.1 Automatic Speech Recognition (ASR)

Traditional ASR systems take a pipelined approach, (Paulik et al., 2008; Cho et al., 2012, 2015; Batista et al., 2008; Gravano et al., 2009; Škodová et al., 2012) relying on post-processing modules to improve the readability of the output text in two critical ways. First, a more robust language model is used to reduce word recognition errors via a second-pass rescoring of the output lattice or top recognition candidates. Then other sub-processes modify the sentence display format for readability using a series of steps, adding capitalization and punctuation, correcting simple grammatical errors, and formatting dates, times, and other numerical entities. We call these steps inverse text normalization (ITN). Originally, researchers mainly exploited handcrafted rules or statistical methods (Shugrina, 2010; Anantaram et al., 2016; Bohac et al., 2012; Liyanapathirana and Popescu-Belis, 2016; Shivakumar et al., 2019; Cucu et al., 2013; Bassil and Alwani, 2012) for post-processing. Recently, Guo et al. (2019) trained an LSTM-based sequence-to-sequence model to correct spelling errors. Hrinchuk et al. (2019) investigated the use of Transfomer-based architectures for the correction of SR output into grammatically and semantically correct forms.

Traditional ASR post-processing methods offer improvements in readability; however, there are two important shortcomings. (1) Since the whole process is divided into several sub-processes, the mistakes in the previous steps will accumulate. For example, in the sentence, “Mary had a little lamb. It’s fleece was white as snow.”, if in the punctuation step, a period ‘.’ is added after the word “had,” the rule-based capitalization will capitalize the word‘a.’ (2) The traditional methods tend to transcribe the speech verbatim while ignoring the readability of the output text. It cannot detect and correct disfluency in spontaneous speech transcripts. For example, in an utterance such as “I want a flight ticket to Boston, uh, I mean to Denver on Friday”, the speaker means to communicate “I want a flight ticket to Denver on Friday.” The segment “to Boston, uh, I mean” in the transcript is not useful for interpreting the intent of the sentence and hinders human readability and the performance of many downstream tasks. Traditional methods optimized for recognition accuracy will keep these words, increasing the cognitive load of the reader.

2.2 Natural Language Processing (NLP)

In NLP research, the most similar task to ours is automatic post-editing (APE) (Bojar et al., 2016), which has been extensively studied by the machine translation (MT) community (e.g., Pal et al., 2016, 2017; Chatterjee et al., 2017; Hokamp, 2017; Tan et al., 2017). These methods take input of the source language text, target language MT output, and target language post-editing (PE) for training. Based on our knowledge, there is no similar work in speech recognition field.

Another similar task is the Grammatical Error Correction (GEC). GEC aims to correct different kinds of errors such as spelling, punctuation, grammatical, and word choice errors (Ge et al., 2018; Zhang et al., 2019; Napoles et al., 2019, 2017; Grundkiewicz et al., 2019; Choe et al., 2019). The difference between our task and GEC is that the latter aims to correct written language, while our task aims to correct spoken language that contains noise introduced by ASR errors as well that introduced by the discrepancy between spoken and written formats of natural language. Due to the similarity between GEC and APR, we borrow some ideas from GEC and use GEC corpora as our seed corpus to synthesize our dataset and use GEC metrics, namely MaxMatch and GLEU, to evaluate APR performance.

2.3 Unsupervised Learning

Pre-training approaches (Dai and Le, 2015; McCann et al., 2017; Howard and Ruder, 2018) have drawn much attention recently, especially those that employ the Transformer (Vaswani et al., 2017)

architecture. The most successful approaches are variants of masked language models, which are denoising autoencoders trained to reconstruct text where a random subset of the words has been masked out. Among them, BERT 

(Devlin et al., 2018) and RoBERTa (Liu et al., 2019) are single-stack Transformer encoders; GPT(-2) (Radford et al., 2018, 2019) and XLNET(Yang et al., 2019) are single-stack Transformer decoders; UniLM (Dong et al., 2019) is a single-stack Transformer serving both encoder and decoder roles; and MASS (Song et al., 2019), BART (Lewis et al., 2019) and T5 (Raffel et al., 2019)

are standard Tranformer-based neural machine translation architecture. We use RoBERTa, UniLM, and MASS as our base architectures and use their pre-trained models for the APR task.

Input She see Tom is catched by policeman
in park at last night.
Output She saw Tom caught by a policeman
in the park last night.
Table 1: A GEC data sample is shown. The input is a sentence with some grammatical errors. The output is a grammatically correct sentence.

3 Proposed Method

Figure 1: The process of data synthesis is shown. The left sentence is from the GEC dataset. The right sentence pair is an APR instance. The source sentence of GEC is processed by the TTS and ASR systems, and the APR sentence pair is obtained. The target sentence of GEC remains unchanged and is used as the target sentence for the APR.

3.1 Dataset

There exist a large amount of data that have been labeled for speech recognition. However, these data have two issues: first, they label the exact words that were spoken, including all disfluency. This is essential for HMM and hybrid acoustic model training but could hinder readability. Second, no capitalization and punctuation is present because spontaneous speech does not follow normal grammatical conventions. Similarly, entities, especially numerical entities, appear different in spoken language than when they appear in written form.

Due to these restrictions, we synthesize our data, simulating ASR errors by feeding sentences from a grammatical error correction (GEC) dataset into a text-to-speech (TTS) system and then transcribing this with an ASR model.

The GEC data samples contain grammatically correct and incorrect sentence pairs. A human corrects the grammatically incorrect sentence to obtain the target grammatically correct sentence. An example sentence pair from the seed corpus is shown in Table 1. Inspired by the GEC task, we simulated ASR errors using the GEC source sentences to obtain sentence pairs of which source sentences contain both grammatical errors and ASR errors.

In the next section, we detail how we simulated the APR data. We discuss the simulated data statistics in 3.1.2.

3.1.1 Dataset Synthesis

We fed the grammatically incorrect sentences from the seed corpus into a neural-TTS system, which produced the audio files simulating human speakers. We used 320 different speaker voices for this simulation and split them into 220 for training, 50 for validation, and 50 for evaluation. Each sentence randomly selected one speaker, and all speakers have the same number of input sentences Deng et al. (2018). Then these audio files are fed into the ASR system that outputs the corresponding transcript. The resulting text contains both the grammatical errors found initially in the GEC dataset and the TTS+ASR pipeline errors. We used original corrected sentences as our target. The whole process is illustrated in Figure 1.

In addition to the mentioned simulation method, we tried using the top-k best output of the ASR system to augment our dataset ten-fold. However, we found that the augmented dataset is not beneficial, due to the lack of diversity in the resulting sentences, which often differ only in some characters (4.1).

3.1.2 Dataset Statistic

Table 2 shows dataset statistic of our data.

Seed corpus Synthetic data
GEC dataset sent pairs sent pairs
training set FCE 28,350 1,100,219
W&I+LOCNESS 34,308
Lang-8 Corpus 1,037,561
CoNLL dev CoNLL-2013 1,381 1,381
test CoNLL-2014 1,312 1,312
JFLEG dev JFLEG dev 754 754
test JFLEG test 747 747
Table 2: Dataset statistics are shown. We create synthetic data from the seed corpus using the synthesis process described in Section 3.1.1. Seed corpus FCE, W&I+LOCNESS, and Lang-8 Corpus are used to synthesize the training data. Two datasets are used as evaluation data, which are evaluated by RA-WER and BLEU metrics. Specifically, following the GEC literature, the CoNLL dataset and JFLEG dataset are evaluated by MaxMatch and GLEU metrics, respectively.

We used the data from the datasets provided by restricted tracks of BEA 2019 shared task (Bryant et al., 2019) as our seed corpora for training. Specifically, we collected data from FCE (Yannakoudakis et al., 2011), Lang-8 Corpus of Learner English (Mizumoto et al., 2011; Tajiri et al., 2012), and W&I+LOCNESS (Bryant et al., 2019; Granger, 2014), totaling to around 1.1 million training samples.

Furthermore, we utilized CoNLL-2014 shared task dataset (Ng et al., 2014) and JFLEG (Napoles et al., 2017) test set as our evaluation seed corpora, to be aligned with the GEC literature (Ge et al., 2018; Zhang et al., 2019; Kiyono et al., 2019). The CoNLL-2014 and JFLEG test sets contain 1,312 and 747 sentences, respectively.

In order to be consistent with the standard evaluation metrics in the GEC literature, we used MaxMatch (M

) (Dahlmeier and Ng, 2012) for CoNLL-2014 and used GLEU (Napoles et al., 2015) for JFLEG evaluation. We used the CoNLL-2013 test set and JFLEG dev set as our development seed corpora for the CoNLL-2014 and JFLEG test sets, respectively.

Finally, through the process described in Section 3.1.1, we obtained the APR dataset illustrated in the right part of Table 2.

3.2 Evaluation Metrics

Since our task is to improve the readability of automatic speech transcription, the word error rate (WER), a conventional metric that is widely used in speech recognition, is not suitable for our use case. As a part of our work, instead, we investigated the usefulness and consistency of different metrics directly or modified from that of related tasks such as speech recognition, machine translation, and grammatical error correction.

Speech Recognition Metric

First, we extended the conventional WER in speech recognition to readability-aware WER (RA-WER) by removing the text normalization before calculating Levenshtein distance. We treated all mismatches due to grammatical mistakes, disfluency, as well as improper formats of capitalization, punctuation, and written numerical entity as errors. If there are alternative references, we selected the closest one to the candidate.

Machine Translation Metric

The APR task can be treated as a translation problem from a spoken transcript to a more readable written text. In this case, we can take advantage of the BiLingual Evaluation Understudy (BLEU) (Papineni et al., 2001)

score that is widely used in machine translation to measure the performance of the APR task. In BLEU, the precision score is computed over variable-length of n-grams with length penalty

(Papineni, 2002) and optionally with smoothing Lin and Och (2004).

Grammatical Error Correction Metrics

Syntax and grammatical errors can significantly impact the readability of speech transcription. To evaluate the correctness and fluency of the rewritten sentences, we used the most commonly used GEC metrics such as MaxMatch (M) and General Language Evaluation Understanding (GLEU) in our work. M

reports the F-score of edits over the optimal phrasal alignment between the candidate and the reference sentences

(Dahlmeier and Ng, 2012). GLEU captures fluency rewrites in addition to grammatical error corrections (Napoles et al., 2015). It is an extension of BLEU (Papineni et al., 2001) by penalizing false negatives. Besides the candidates and references used in other metrics, GEC metrics also consider source sentences in order to detect the model edits. In all experiments, we used raw ASR transcription as the source sentence when calculating GEC metrics.

3.3 Baseline Setup

We used the production 2-step post-processing pipeline of our speech recognition system as the APR baseline, namely n-best LM rescoring followed by inverse text normalization (ITN). This pipeline works well for sequentially improving speech recognition accuracy and display format. We computed the values of the metrics between system output of every step and the reference grammatical sentence. As a comparison, we also evaluated the original ungrammatical sentences in the same corpora. Table 3 shows these baseline results on CoNLL-2014 and JFLEG test sets.

ID Candidate CoNLL2014 JFLEG Test
a Ungrammatical 7.51 87.79 83.99 11.84 80.56 51.69
b ASR transcription 31.15 60.11 0.00 29.50 62.42 20.71
c (b) + ITN 22.01 70.28 65.76 19.14 72.15 38.48
d (b) + LM rescoring 30.13 61.82 17.63 28.76 63.65 24.38
e (d) + ITN 20.70 72.43 68.37 18.22 73.71 42.42
Table 3: Baseline results are shown. Source sentences are the raw ASR transcriptions, which are the output of the TTS and ASR pipeline obtained from the ungrammatical inputs in CoNLL2014 and JFLEG test sets. References are the corresponding corrected sentences in the two corpora. Candidates are the outputs of each step of the baseline system. (a): the original ungrammatical sentence before data synthesis, which is for reference; (b): the raw ASR transcription, the same with the source; (c): ASR transcription followed by ITN; (d): ASR transcription followed by second-pass LM rescoring; (e): ASR transcription followed by LM rescoring and then ITN.

An interesting finding is that although the original ungrammatical sentences in JFLEG have more errors or are less smooth than the ones in CoNLL2014 according to the higher RA-WER (11.84 vs. 7.51) and lower BLEU (80.56 vs. 87.79), the situation is inverted after transforming the sentences to and back from speech (29.50 vs 31.15 in RA-WER and 62.42 vs 60.11 in BLEU). This result may indicate that: 1) JFLEG annotators focused more on fluency and formality of the rewriting rather than pure and token-level error corrections in CoNLL2014, and 2) the ASR system, due to the powerful language model, has the ability to regularize input errors and make the transcription appear more fluent and formal. Second, GEC metrics are more sensitive to correct edits than other metrics due to the consideration of the input source sentences. Third, ITN consistently shows a much more significant impact than LM rescoring, which demonstrates the importance of display format in readability and also raises the question of how to further emphasize the correctness for future work.

3.4 Models

In this work, we compare different Transformer (Vaswani et al., 2017) architectures together with corresponding open-sourced pre-trained models.

3.4.1 Mass111

MASS (Song et al., 2019) adopts the encoder-decoder framework to reconstruct a sentence fragment given the remaining part of the sentence. This framework is ideally suited for our task.

Following the MASS setting, we tokenized the data using the Moses toolkit333 and used the same BPE codes and vocabulary from MASS.

We fine-tuned the model based on the weights pre-trained on English monolingual data. The model consists of a 6-layer encoder and a 6-layer decoder. The learning rate was with linear warm-up beginning from for the first 4K updates, followed by inverted squared decay. To fully utilize the GPU, we use dynamically sized mini-batches with 3000 tokens per batch.

3.4.2 UniLM444

UniLM (Dong et al., 2019) was pre-trained using the BERT-large (Devlin et al., 2018) architecture and three types of language modeling tasks: unidirectional, bidirectional, and sequence-to-sequence prediction. The unified modeling approach allows UniLM to be used for both discriminative and generative tasks.

Following the UniLM setting, we tokenize the training data using WordPiece (Wu et al., 2016) with vocabulary size 28,996. The model is a 24-layer Transformer with around 340M parameters.

We fine-tuned the model for 4 epochs. The learning rate was

, with linear warmup over the one-tenths of total steps and linear decay. The batch size, maximum sequence length and masking probability were set to 256, 192 and 0.7, respectively. We also used label smoothing

(Szegedy et al., 2016) with a rate of 0.1.

Following standard practice, we removed duplicate trigrams in beam search and tuned the maximum output length and length penalty on the development set (Paulus et al., 2017; Fan et al., 2017).

3.4.3 RoBERTa666

RoBERTa (Liu et al., 2019) is a robustly optimized BERT (Devlin et al., 2018) pre-training approach. Both BERT and RoBERTa have single Transformer stack and are pre-trained only using bidirectional prediction, which makes them more discriminative than generative. However,  Hrinchuk et al. (2019) demonstrated the effectiveness of transfer learning from BERT to sequence-to-sequence task by initializing both encoder and decoder with pre-trained BERT in their speech recognition correction work.

Inspired by this work and UniLM, we applied self-attention masks on the RoBERTa model to convert it into a sequence-to-sequence generation model. To achieve whole-sentence prediction rather than only masked-position prediction, we used an autoregressive approach during the fine-tuning. Another benefit from this approach is that the model can predict the end of sentence precisely; hence, there is no need to tune the maximum output length and length penalty as in UniLM fine-tuning.

Following the RoBERTa setting, the sentences were tokenized with a byte-level BPE tokenizer. The vocabulary size was 50,265. We fine-tuned the model based on RoBERTa-large pretrained weights.

3.4.4 RoBERTa-UniLM

Besides using UniLM and RoBERTa, we also experimented to leverage the advantages of both works to further enhance the pre-trained model before fine-tuning it on APR task. We adapted RoBERTa-large model by training it longer on a combination of English Wikipedia888, Books999, and News-Crawl101010 data, totaling to 66GB of uncompressed text. The training was similar to UniLM but also included autoregressive (both left-to-right and right-to-left) prediction. We kept next-sentence objective in the bidirectional masked LM (MLM). All predictions conducted by whole-word masking. The first three predictions also had bigram, trigram, and phrase masking each on about 10% of the training instances. We used Huggingface Transformers111111 code for LM fine-tuning. The RoBERTa-UniLM model was trained for 10 days on 64 NVIDIA DGX-2 GPU cards for 7,200 steps with a batch size of 12,800. The learning rate was , with the same warmup and decay strategy with UniLM fine-tuning. The APR task fine-tuning and decoding were the same with the RoBERTa experiment.

In all fine-tuning experiments described above, checkpoints were selected on the development set, and the beam size for beam search was set to 5.

4 Results and Discussion

4.1 Dataset Selection

As described in Section 3.1.1, we constructed the APR data using TTS and ASR systems. When an audio file synthesized by TTS is inputted to the ASR system, it will generate multiple candidate sentences from the beam search for re-ranking. These sentences have a few words that are different from the final output. At first, we used all of these sentences as our APR training data. When training our model on this data, we found that the loss converges very quickly. It usually takes only one-fourth epoch to converge. We inspected the data and found that top-K sentences produced by beam search lack in diversity, often differing only in a few characters. To further verify our assumption, we conducted an experiment on the MASS model and training data with different sizes.

LARGE (18.6M) 18.96 74.90 71.05
MODERATE (16M) 17.15 76.20 71.76
SMALL (1.1M) 18.56 74.51 71.43
ORIGIN (1.1M) 24.46 67.28 51.80
Table 4: Evaluation of MASS model that is fine-tuned on different size training dataset is shown. The values are evaluated on the CoNLL test set. The numbers in the parentheses are the approximate number of sentence pairs. MASS trained on MODERATE data achieves the best scores on all metrics. MASS trained on SMALL data gets a comparable result to the highest scores with a significantly smaller dataset. ORIGIN is the original GEC sentence pairs, which are used as the seed corpus for the APR task.

Table 4 shows the results. SMALL data only includes the best output of the beam search and has 1.1M sentence pairs. MODERATE data contains top-k beams obtained with the beam search and has 16M sentences pairs. LARGE is the largest data, which also comprises original GEC pairs and TTS normalized data in addition to all data in MODERATE. LARGE has 18.6M sentence pairs. To demonstrate the difference between the GEC task and APR task, we also used the original GEC pairs as the training data denoted as ORIGIN.

In table 4, we can see MODERATE data get the best scores on all metrics. That means that including top-k beams indeed helped the APR task. However, by only using SMALL data, we still got a comparable result. The remaining data (15M) yielded a 1.69 increase on BLEU. This result proves our assumption that the top-k beams obtained with the beam search are homogeneous, which is not very beneficial for the model to learn new patterns from the data. Given these results and efficient usage of computational resources, we used SMALL data in the remainder of our experiments. It is interesting that the LARGE data got a lower score than the SMALL data. We think the cause is the original GEC pairs, and TTS normalized data having different patterns with the ASR output data.

The last row of Table 4 is the ORIGIN data, which has the same target sentences with the SMALL data but differs in source sentences. The MASS model trained on GEC pairs only got 67.28 BLEU, which is much lower than any dataset with ASR output as the source. It shows that the GEC task is different from the APR task, and APR is a new task that deserves a dedicated research effort.

4.2 Model Comparison

In Table 5, we report the experimental results of four fine-tuned models on SMALL dataset (1.1M sentence pairs) and compared them with the baseline method.

Model CoNLL2014 JFLEG
Rescoring + ITN 20.70 72.43 68.37 18.22 73.71 42.42
MASS 18.56 74.51 71.43 18.37 75.37 47.64
UniLM 18.06 76.32 72.94 17.10 76.58 51.78
RoBERTa 16.59 77.38 73.65 13.88 80.34 51.21
RoBERTa-UniLM 16.62 77.24 74.83 14.13 80.77 51.16
Table 5: Experimental results of the baseline method and four fine-tuned models on the APR task are shown. The best values are highlighted in bold font. For RA-WER, lower is better. For other metrics, higher is better. The results are evaluated on the CoNLL2014 and JFLEG datasets, respectively. is the baseline method.

Compared with the 2-step pipeline baseline, all the fine-tuned model got better scores on almost all metrics, which proved the effectiveness of considering the APR as a sequence-to-sequence task and utilizing a pre-trained model. The only exception is that MASS got a higher RA-WER (18.37) than the baseline.

In four fine-tuned models, MASS had a lower performance compared to the other three. This is reasonable since MASS only has a 6-layer encoder and 6-layer decoder, which is equivalent to a 12-layer BERT base model with about 110M parameters, while the other three are all based on 24-layer BERT-large model with about 340M parameters. The result proved again that high capacity Transformer architecture had a positive impact on the ASR task. To compare the experimental results fairly, we will focus on three BERT-large based model in the following discussion.

RoBERTa and RoBERTa-UniLM model achieved better scores than UniLM in all metrics except GLEU on JFLEG test sets. Our experiments demonstrated that fine-tuned downstream tasks based on RoBERTa gave better performance than based on BERT, which is consistent with the RoBERTa paper (Liu et al., 2019).

For CoNLL2014 test set, the three BERT-large based models got the comparable scores. While RoBERTa won in RA-WER and BLEU metrics, RoBERTa-UniLM won in M. CoNLL2014 test set includes minimal edits which correct the grammatical errors of a sentence but do not necessarily make it fluent or native-sounding. For JFLEG test set, an interesting finding is that although RoBERTa based model has fewer errors or more smooth than UniLM according to the lower RA-WER (13.88 vs. 17.10) and higher BLEU (80.77 vs. 76.58), the situation is opposite when considering fluency which is evaluated by GLEU (51.78 vs. 51.21). Unlike CoNLL, JFLEG does not restrict corrections to minimal error spans. Instead, it contains holistic sentence rewrites, which represent a diversity of potential corrections with four corrections of each sentence. Therefore, the less error and better alignment are not equivalent to the fluency or sound natural to a native English speaker, which means the metrics from ASR/MT/GEC may not be the best metrics for APR, and we need to find more suitable metrics for APR which considers more readability.

RoBERTa-UniLM did not show significant benefits than RoBERTa overall. One possible reason could be fewer LM training data. RoBERTa used 160GB text while RoBERTa-UniLM used 66GB. Another reason could be the training time. Due to the limitation of computation resources, we only updated the model for 7200 steps, and the training has not fully converged yet. However, even with limited training data and model updates, RoBERTa-UniLM demonstrated its potential in some metrics such as M on CoNLL2014 and BLUE on JFLEG, while other metrics had very close scores to RoBERTa. This encouraged us to continue to improve the pre-trained model in the future.

In table 6, we show some examples generated by RoBERTa-UniLM. In these samples, the model correctly added punctuation, capitalization, changed the number formatting and corrected the recognition errors.

Input: when we’re diagram honest out with certain genetic disease are we supposed to disclose this result to our relatives
Output:When we are diagnosed with a certain genetic disease, are we supposed to disclose this result to our relatives?
Input: this has been indicated in report done by boston university in two thousand and twelve eighty percent of information posted online is true other person may like three out of ten times in real life
Output: This has been indicated in a report done by Boston University in 2012. 80% of information posted online is true, although a person may lie three out of ten times in real life.
Input: on the other hand although in turkey the public transportation is getting developed especially in the two big cities that is an anchor in istanbul it is still nothing compared to sweden
Output: On the other hand, although in Turkey the public transportation is getting developed, especially in the two big cities, Ankara and Istanbul, it is still nothing compared to Sweden.
Input: we have a lot of paper have to write
Output: We have a lot of papers to write.
Input: in my SC i want to focus on how important is for students to learn facts
Output: In my essay, I want to focus on how important it is for students to learn facts.
Table 6: Samples generated by the RoBERTa-UniLM model are shown. The italic font indicates the places of modifications.

5 Conclusion

In this work, we proposed a new NLP task named ASR Post-processing for Readability (APR) that aims to correct grammar mistakes, disfluency, and readibility of the ASR output. We described our process for synthesizing a dataset for the APR task using the GEC datasets as our seed corpus by using TTS and ASR systems. We borrowed metrics from similar tasks and extended WER into readability-aware WER. We experimented with different dataset sizes and compared different models (MASS, UniLM, RoBERTa, RoBERTa-UniLM) with a traditional post-processing system. The results show that the fine-tuned models improved the readability of ASR output significantly, hinting at potential benefits of the APR task. We hope that our findings will encourage other researchers to work on improving readability in speech transcription systems. APR is an interesting research topic that can be considered as a style transfer from informal spoken language to a written, more formal language.


  • C. Anantaram, S. K. Kopparapu, C. Patel, and A. Mittal (2016) Repairing general-purpose asr output to improve accuracy of spoken sentences in specific domains using artificial development approach.. In IJCAI, pp. 4234–4235. Cited by: §2.1.
  • Y. Bassil and M. Alwani (2012) Post-editing error correction algorithm for speech recognition using bing spelling suggestion. arXiv preprint arXiv:1203.5255. Cited by: §2.1.
  • F. Batista, D. Caseiro, N. Mamede, and I. Trancoso (2008) Recovering capitalization and punctuation marks for automatic speech recognition: case study for portuguese broadcast news. Speech Communication 50 (10), pp. 847–862. Cited by: §2.1.
  • M. Bohac, K. Blavka, M. Kucharova, and S. Skodova (2012) Post-processing of the recognized speech for web presentation of large audio archive. In 2012 35th International Conference on Telecommunications and Signal Processing (TSP), pp. 441–445. Cited by: §2.1.
  • O. Bojar, R. Chatterjee, C. Federmann, Y. Graham, B. Haddow, M. Huck, A. J. Yepes, P. Koehn, V. Logacheva, C. Monz, et al. (2016) Findings of the 2016 conference on machine translation. In Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, pp. 131–198. Cited by: §2.2.
  • C. Bryant, M. Felice, Ø. E. Andersen, and T. Briscoe (2019) The bea-2019 shared task on grammatical error correction. In BEA@ACL, Cited by: §3.1.2.
  • R. Chatterjee, M. A. Farajian, M. Negri, M. Turchi, A. Srivastava, and S. Pal (2017) Multi-source neural automatic post-editing: fbk’s participation in the wmt 2017 ape shared task. In Proceedings of the Second Conference on Machine Translation, pp. 630–638. Cited by: §2.2.
  • E. Cho, J. Niehues, K. Kilgour, and A. Waibel (2015) Punctuation insertion for real-time spoken language translation. In Proceedings of the Eleventh International Workshop on Spoken Language Translation, Cited by: §2.1.
  • E. Cho, J. Niehues, and A. Waibel (2012) Segmentation and punctuation prediction in speech language translation using a monolingual translation system. In International Workshop on Spoken Language Translation (IWSLT) 2012, Cited by: §2.1.
  • Y. J. Choe, J. Ham, K. Park, and Y. Yoon (2019) A neural grammatical error correction system built on better pre-training and sequential transfer learning. arXiv preprint arXiv:1907.01256. Cited by: §2.2.
  • H. Cucu, A. Buzo, L. Besacier, and C. Burileanu (2013) Statistical error correction methods for domain-specific asr systems. In International Conference on Statistical Language and Speech Processing, pp. 83–92. Cited by: §2.1.
  • D. Dahlmeier and H. T. Ng (2012) Better evaluation for grammatical error correction. In HLT-NAACL, Cited by: §3.1.2, §3.2.
  • A. M. Dai and Q. V. Le (2015) Semi-supervised sequence learning. In Advances in neural information processing systems, pp. 3079–3087. Cited by: §2.3.
  • Y. Deng, L. He, and F. Soong (2018) Modeling multi-speaker latent space to improve neural tts: quick enrolling new speaker and enhancing premium voice. arXiv preprint arXiv:1812.05253. Cited by: §3.1.1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §2.3, §3.4.2, §3.4.3.
  • L. Dong, N. Yang, W. Wang, F. Wei, X. Liu, Y. Wang, J. Gao, M. Zhou, and H. Hon (2019) Unified language model pre-training for natural language understanding and generation. arXiv preprint arXiv:1905.03197. Cited by: §1, §2.3, §3.4.2.
  • A. Fan, D. Grangier, and M. Auli (2017) Controllable abstractive summarization. arXiv preprint arXiv:1711.05217. Cited by: §3.4.2.
  • T. Ge, F. Wei, and M. Zhou (2018) Reaching human-level performance in automatic grammatical error correction: an empirical study. arXiv preprint arXiv:1807.01270. Cited by: §2.2, §3.1.2.
  • S. Granger (2014) The computer learner corpus: a versatile new source of data for sla research: sylviane granger. In Learner English on Computer, pp. 25–40. Cited by: §3.1.2.
  • A. Gravano, M. Jansche, and M. Bacchiani (2009) Restoring punctuation and capitalization in transcribed speech. In 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4741–4744. Cited by: §2.1.
  • R. Grundkiewicz, M. Junczys-Dowmunt, and K. Heafield (2019) Neural grammatical error correction systems with unsupervised pre-training on synthetic data. In Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 252–263. Cited by: §2.2.
  • J. Guo, T. N. Sainath, and R. J. Weiss (2019) A spelling correction model for end-to-end speech recognition. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5651–5655. Cited by: §2.1.
  • C. Hokamp (2017)

    Ensembling factored neural machine translation models for automatic post-editing and quality estimation

    arXiv preprint arXiv:1706.05083. Cited by: §2.2.
  • J. Howard and S. Ruder (2018) Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146. Cited by: §2.3.
  • O. Hrinchuk, M. Popova, and B. Ginsburg (2019) Correction of automatic speech recognition with transformer sequence-to-sequence model. arXiv preprint arXiv:1910.10697. Cited by: §2.1, §3.4.3.
  • S. Kiyono, J. Suzuki, M. Mita, T. Mizumoto, and K. Inui (2019) An empirical study of incorporating pseudo data into grammatical error correction. ArXiv abs/1909.00502. Cited by: §3.1.2.
  • M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer (2019) BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461. Cited by: §2.3.
  • C. Lin and F. J. Och (2004) ORANGE: a method for evaluating automatic evaluation metrics for machine translation. In COLING, Cited by: §3.2.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §1, §2.3, §3.4.3, §4.2.
  • J. Liyanapathirana and A. Popescu-Belis (2016) Using the ted talks to evaluate spoken post-editing of machine translation. In Proceedings of the 10th Language Resources and Evaluation Conference (LREC), Cited by: §2.1.
  • B. McCann, J. Bradbury, C. Xiong, and R. Socher (2017)

    Learned in translation: contextualized word vectors

    In Advances in Neural Information Processing Systems, pp. 6294–6305. Cited by: §2.3.
  • T. Mizumoto, M. Komachi, M. Nagata, and Y. Matsumoto (2011) Mining revision log of language learning sns for automated japanese error correction of second language learners. In IJCNLP, Cited by: §3.1.2.
  • C. Napoles, M. Nădejde, and J. Tetreault (2019) Enabling robust grammatical error correction in new domains: data sets, metrics, and analyses. Transactions of the Association for Computational Linguistics 7, pp. 551–566. Cited by: §2.2.
  • C. Napoles, K. Sakaguchi, M. Post, and J. R. Tetreault (2015) Ground truth for grammaticality correction metrics. In ACL, Cited by: §3.1.2, §3.2.
  • C. Napoles, K. Sakaguchi, and J. Tetreault (2017) JFLEG: a fluency corpus and benchmark for grammatical error correction. arXiv preprint arXiv:1702.04066. Cited by: §2.2, §3.1.2.
  • H. T. Ng, S. M. Wu, T. Briscoe, C. Hadiwinoto, R. H. Susanto, and C. Bryant (2014) The conll-2014 shared task on grammatical error correction. In Proceedings of the Eighteenth Conference on Computational Natural Language Learning: Shared Task, pp. 1–14. Cited by: §3.1.2.
  • S. Pal, S. K. Naskar, M. Vela, Q. Liu, and J. van Genabith (2017) Neural automatic post-editing using prior alignment and reranking. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pp. 349–355. Cited by: §2.2.
  • S. Pal, S. K. Naskar, M. Vela, and J. van Genabith (2016) A neural network based approach to automatic post-editing. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 281–286. Cited by: §2.2.
  • K. Papineni, S. Roukos, T. Ward, and W. Zhu (2001) Bleu: a method for automatic evaluation of machine translation. In ACL, Cited by: §3.2, §3.2.
  • K. Papineni (2002) Machine translation evaluation: n-grams to the rescue. In LREC, Cited by: §3.2.
  • M. Paulik, S. Rao, I. Lane, S. Vogel, and T. Schultz (2008) Sentence segmentation and punctuation recovery for spoken language translation. In 2008 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5105–5108. Cited by: §2.1.
  • R. Paulus, C. Xiong, and R. Socher (2017) A deep reinforced model for abstractive summarization. ArXiv abs/1705.04304. Cited by: §3.4.2.
  • A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever (2018) Improving language understanding by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openai-assets/researchcovers/languageunsupervised/language understanding paper. pdf. Cited by: §2.3.
  • A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. OpenAI Blog 1 (8). Cited by: §2.3.
  • C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu (2019) Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683. Cited by: §2.3.
  • P. G. Shivakumar, H. Li, K. Knight, and P. Georgiou (2019) Learning from past mistakes: improving automatic speech recognition output via noisy-clean phrase context modeling. APSIPA Transactions on Signal and Information Processing 8. Cited by: §2.1.
  • M. Shugrina (2010) Formatting time-aligned asr transcripts for readability. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 198–206. Cited by: §2.1.
  • S. Škodová, M. Kuchařová, and L. Šeps (2012) Discretion of speech units for the text post-processing phase of automatic transcription (in the czech language). In International Conference on Text, Speech and Dialogue, pp. 446–455. Cited by: §2.1.
  • K. Song, X. Tan, T. Qin, J. Lu, and T. Liu (2019) Mass: masked sequence to sequence pre-training for language generation. arXiv preprint arXiv:1905.02450. Cited by: §1, §2.3, §3.4.1.
  • C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2016)

    Rethinking the inception architecture for computer vision


    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 2818–2826. Cited by: §3.4.2.
  • T. Tajiri, M. Komachi, and Y. Matsumoto (2012) Tense and aspect error correction for esl learners using global context. In ACL, Cited by: §3.1.2.
  • Y. Tan, Z. Chen, L. Huang, L. Zhang, M. Li, and M. Wang (2017) Neural post-editing based on quality estimation. In Proceedings of the Second Conference on Machine Translation, pp. 655–660. Cited by: §2.2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NIPS, Cited by: §2.3, §3.4.
  • Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, J. Klingner, A. Shah, M. Johnson, X. Liu, L. Kaiser, S. Gouws, Y. Kato, T. Kudo, H. Kazawa, K. Stevens, G. Kurian, N. Patil, W. Wang, C. Young, J. Smith, J. Riesa, A. Rudnick, O. Vinyals, G. S. Corrado, M. Hughes, and J. Dean (2016) Google’s neural machine translation system: bridging the gap between human and machine translation. ArXiv abs/1609.08144. Cited by: §3.4.2.
  • W. Xiong, L. Wu, F. Alleva, J. Droppo, X. Huang, and A. Stolcke (2018) The microsoft 2017 conversational speech recognition system. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5934–5938. Cited by: §1.
  • Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le (2019) XLNet: generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237. Cited by: §2.3.
  • H. Yannakoudakis, T. Briscoe, and B. Medlock (2011) A new dataset and method for automatically grading esol texts. In ACL, Cited by: §3.1.2.
  • Y. Zhang, T. Ge, F. Wei, M. Zhou, and X. Sun (2019) Sequence-to-sequence pre-training with data augmentation for sentence rewriting. arXiv preprint arXiv:1909.06002. Cited by: §2.2, §3.1.2.