In the last 5 years, automatic speech recognition (ASR) technology has advanced enough to be used in real-life applications. Recognition technology has been used extensively to transcribe speeches for large languages such as English, German or Spanish [1, 2, 3]
. These systems are often composed of an ASR module to produce audio-to-text transcription and several natural language processing modules to improve text formatting. The main issue, however, is that neither module performs with perfect accuracy so manual post-processing is needed to produce final transcriptions.
The system for automatically transcribing university lectures in Spanish  compared three post-processing approaches: one involving automatic corrections, another using lecturer corrections, and a third using a mixture of both. The system was tested on twenty lectures and the WER was compared to a real-time factor of the post-editing time versus the total duration of the lecture. The authors conclude that the edit time is directly correlated with the transcription accuracy, but the relationship between the real-time factor and WER was weak, perhaps due to the low WER range produced by the ASR. In the English transcription system , the challenge of achieving a low error rate in transcribing university lectures was handled using collaborative editing. The authors’ findings conclude that correcting transcripts with WER lower than 25% increases the editing effort. The transcription errors for lectures in German  were corrected using student edits and this error correction was studied. During the transcription process they noted that their ASR made errors caused by uncommon and non-German terms in the lectures. Their analysis showed that the corrections of inexperienced editors tend to bring a high WER down to about 25%, corroborating the findings of .
Evaluation of post-editing transcribed speech was studied in , where authors observe a strong variation in editing accuracy and speed among editors. Authors also note that low WER transcripts require advanced editing strategies to achieve error rate improvements comparable to improvements for high WER transcripts. Different transcription strategies were compared in ; namely a fully manual post-editing of ASR transcripts and confidence-enhanced post-editing of ASR transcripts. The authors conclude that post-editing automatic transcripts results in more accurate and faster transcripts, when compared to manually transcribing from scratch. This conclusion was further corroborated in , which dealt with automatic subtitling of videos.
This paper evaluates an ASR system in the context of transcribing speeches for the Icelandic parliament - Althingi. The system has only recently been developed for Icelandic [7, 8, 9, 10], and is now being incorporated into the transcription process of Althingi. The current manual procedure is done in two stages: an initial manuscript is obtained from a contracted transcription service, which is then post-edited by in-house specialists. The main objective of the current project is to replace the initial manual transcription process with an automatic speech recognizer. This is the first time an ASR system is used as a core component in transcription for the Icelandic language, and the purpose of this paper is threefold; to introduce the evaluation procedure, to present the first measurements of the manual post-editing and to report on the performance of the system.
2 Transcription System for Althingi
The current transcription procedure for the Icelandic parliament is done in two stages, illustrated in Figure 1. The speeches are first created in the Althingi document management system, Documentum, as XML documents, with only the speech meta-data and a link to the speech in the MP3 format. Then, in the manual transcription stage the transcribers listen to the audio and transcribe the speech into the XML document (Text A). The initial transcript is meant to reflect the spoken record as accurately as possible but the transcribers might also enrich the text with minor changes. For example, they might add in different formatting for poems and remove repetitions. Next, in the manual editing stage the XML speech document is sent to the editors who modify the speech to be fit for publication and record their editing time. It is common that an editor corrects transcription errors, fixes grammar or enriches the text with context to make the parliamentary record clearer. Finally, the speeches (Text C) are published to their website.
The main objective of the current project is to replace the first stage, manual transcription, with an automatic speech recognizer. Before the experiment, the in-house specialists gave suggestions regarding relevant data to gather and discussed the important differences between the ASR and manual transcriptions. For the experiment, the manual transcription and ASR transcriptions were done in parallel. With the intent of using Text A as reference material, only the ASR transcriptions received manual post-editing. The experiment was performed for a week; on the first day, only the first stage was tested, to ensure the integration worked as intended. For that week the Icelandic parliament was in session for four days. It is from the last three days that this data was gathered.
2.1 The ASR system
The details about the preparation of the ASR training data and the development of the ASR can be found in 
. The acoustic model is a deep neural network, based on a recipe developed for the Switchboard corpus111https://github.com/kaldi-asr/kaldi/blob/master/egs/swbd/s5c/
local/chain/tuning/run_tdnn_lstm_1e.sh, using the Kaldi ASR toolkit . It is a sequence trained neural network based on lattice-free MMI . It consists of seven time delay deep neural network layers 
and three long-short term memory layers15]
. The first one is a pruned trigram model, used in the decoding. The other one is a 5-gram language model, trained on the total parliamentary text set, 55M tokens, and is used for re-scoring decoding results. The lexicon is based on the pronunciation dictionary from the Hjal project, available at Málföng222http://www.malfong.is. We added words from the language model training data, which appeared three or more times, with some constraints, resulting in roughly a dictionary containing 200k words. Inconsistencies in the pronunciation dictionary were also fixed. The WER of the ASR, before any post-processing is done, is on the test set, using 1500 hours of parliamentary speeches and corresponding text, for training. In real life, this number is going to be higher, partly because of imperfect punctuation reconstruction and disparate casing of many words in our texts, and partly because the ASR test set had been manually cleaned to better match the audio.
2.2 Automatic post-processing
The ASR returns a stream of words with no punctuation or formatting. Since the purpose of the system is to publish parliamentary speeches, human readability needs to be factored into the final transcription. The OpenGrm Thrax Grammar Development tool [17, 18] was used to compile grammars into weighted finite-state transducers, in order to denormalize numbers and abbreviations, according to parliamentary conventions.
The Punctuator toolkit 
is used to restore punctuations in the text, specifically periods, question marks and colons. There are no clear rules for the use of commas in Icelandic, making learning their position difficult. Therefore, no commas are added to the ASR transcripts. Punctuator is a bidirectional recurrent neural network model with an attention mechanism. It can both be trained on punctuation annotated text only, or additionally, take in pause annotated text. Both versions were tested, with the text-only training giving better results, an overall-score of versus a score of for the two stage training. These errors are obtained on well structured text and are likely higher in the automatically transcribed speeches. The training text set contains roughly 50M words. The development and test set contain 114k and 111k words, respectively. The pause annotated text set contains 1.3M words. The pause annotated development and test sets are each around 81k words. The pause information is obtained from existing alignment lattices, from earlier data preparations before the ASR training.
Apart from punctuation, formatting also plays a large factor in human readability. Therefore, Thrax grammar rules are used to capitalize the start of sentences and to collapse expanded acronyms. They are also used for other small formatting, such as timestamps, regulations, time intervals, and websites. Another important task for long texts is adding paragraph insertions. Currently, a new paragraph is only started whenever the speaker of the house is addressed.
2.3 Integration with the Althingi system
The ASR needs to connect with the Althingi servers in four different instances. This is enabled for the first three occurrences via a representational state transfer application programming interface (RESTful API). The API first takes in the timestamp of when the speech ends through a GET request. Then, using the ending timestamp, the ASR server queries Althingi’s metadata server to obtain the timestamp of when the speech started. With the two timestamps, the ASR server queries Althingi’s audio server for the audio segment, which the ASR server then downloads. The rest of the experiment is semi-automatic. With the ASR, the audio is then transcribed. Next, the ASR TXT document (Text B of Figure 2) is batched and wrapped in the speech metadata as well as XML tags. Finally, they are copied from the ASR server and manually entered into the Documentum editor queue. After the speeches are post-edited (Text D), they are posted onto the Althingi website333http://www.althingi.is.
Currently, the ASR is housed on its own server and interacts with the rest of the Althingi servers through the RESTful API. The ASR is built on a Ubuntu 16.04 server with 4 CPUs and 16 GB of RAM. The number of parallel transcriptions are limited by the number of CPUs within the server. During the test, 3-4 speeches were processed in parallel because members of parliament tended to deliver speeches faster than the ASR could transcribe.
2.4 ASR integration concerns
Since this integration was only for the experiment, not permanent, the ASR speeches needed to be delivered to the editors while still keeping the existing transcription procedure intact. In order to manage this, automatically transcribed speeches were manually entered into Documentum. To accomplish the goal of keeping the test procedure separate but integrated, Althingi put the ASR speeches in their own separate folder which was then integrated with the normal procedure at the post-editing stage through the Documentum lifecycles. The in-house specialists’ queue only showed the ASR transcriptions.
Despite familiarity with the technical details, a deeper understanding of the Althingi speech publishing procedure was needed. Thus, several of their in-house specialists were requested to give valuable insight on important details in the post-editing procedure which could not be gleaned from data. For example, the idea of inserting a new paragraph when parliamentarians address the speaker of the house as it will usually signal a change in topic originated from these specialists.
The primary objective for this experiment is to discover the impact of switching from manual transcriptions to automatic transcriptions on the Althingi publication department’s work. Figure 2 illustrates the experimental setup. First, the speech segment is sent to both the manual transcription stage and the ASR stage. Wherein, Text A and Text B are created. Then, only Text B is sent to the editor queue for the in-house specialists to post-edit and produce the final transcription, Text D.
Over a three day period, the Icelandic parliament delivered 279 speeches. However, at the conclusion of the experiment, 35 speeches still hadn’t been processed by the publications department, and 5 speeches were duplicates. Therefore, only 239 speeches could be analysed. The data collected includes the following: speech length, word count, edit time, editor feedback and calculations of the subsequent measures. The system was evaluated using the following measures: 1) word edit distance (WED) [%], 2) edit time per word (ET/W) [s/w], and 3) real-time factor (RTF) [-]. All three measures reflect on an editor’s effort in processing a transcribed speech. The WED was calculated using the following formula:
where S, I, D, N is the number of substitutions, insertions, deletions and total words respectively, obtained by aligning the texts. This formula is identical to WER, but since it also reports on the edit distance between transcription and final text, the term word edit distance is preferred. The RTF was computed as the edit time in seconds divided by the speech length in seconds.
For Text A, the transcribers were asked to transcribe the audio as true as possible, leaving in speaker errors, in order to get good reference texts. Since this is contrary to the work they normally do, some of the texts were not true-to-audio. The manual transcriptions tended to have small corrections since repetitions are removed, badly structured sentences are corrected, and three words common to parliamentary speeches are abbreviated. In addition, they also contained spelling mistakes and word substitutions due to malformed speech. Despite these flaws, Text A is still the better reference than Text D when estimating errors as both automatic and manual transcription aim to produce audio-to-text transcription. However, it is true that the DB alignment better reflects the work the editors do to make an automatically transcribed text publishable. Hence, one would expect the WED between Text B and Text D to better explain ET/W than the edit distance between Text A and Text B. The DB results are obtained under the verification and guidance of the AB results.
Editors gave feedback in the form of comments and grades for the whole system and on individual speeches. While recording the edit time for a speech the editors were also asked to grade and comment on the speech based on their own perceptions. There were no guidelines other than giving the speeches a grade (Good, Medium, or Bad). Not giving them guidelines better simulates their day-to-day feelings. After the experiment they filled out a short survey with their evaluations of the current procedure versus the inclusion of the automatic transcription system.
In the succeeding week speeches were edited with the procedure illustrated in Figure 1 to produce the results for the fully manual procedure, referenced as ’Fully Manual’ later in the the text. They lend insight into the speed of the fully manual transcription process.
The ultimate goal of the automatic transcription system is to replace human transcribers, so the obvious benchmark to compare against are the results for the fully manual transcription process. This would include matching the ET/W and RTF, but not necessarily the word edit distance.
|ET/W [s/w]||RTF [-]|
|Fully Manual||1.32 0.51||2.66 1.05|
|Automatic||1.52 0.53||3.26 1.24|
The DB alignment results are summarized in Table 1. The analysis shows that the automatic process under-performs when compared to the fully manual process. The ET/W is higher by 0.20 s/w, and the RTF by 0.60. The initial hypothesis was that the edit distance would be the main factor affecting the edit time and that the higher the distance, the higher the edit time. Also, that both RTF and ET/W would positively correlate with WED. In order to confirm this hypothesis, the linear correlation analysis was performed to model the dependence of ET/W and RTF on WED. Also, the Pearson’s correlation coefficient (PCC) between the two variables was computed. The results are as follows:
ET/W = 0.019 * WED + 1.08
RTF = 0.03 * WED + 2.56
PCC(WED,ET/W) = 0.33
PCC(WED,RTF) = 0.22
These are the observations from this analysis: 1) reducing the ET/W to the manual level would require lowering the WED to 12.6%, 2) producing perfect automatic transcriptions can only outperform manual transcription RTF by , since, the cost of reading through the transcription, independent of any errors, far outweighs the impact of errors, 3) the correlation between the variables is low, indicating that WED might not be the best predictor of editing efforts.
The following assessment focused on system performance in terms of several error types: ASR, punctuation, capitalization and abbreviation mismatches. The analysis showed that the majority of transcription errors occur due to the ASR and wrong punctuation by far, followed by capitalization, and abbreviation respectively. As a consequence, an improvement to the ASR appears to be of the highest priority. However, in the post-experiment survey the editors frequently commented on inaccurate punctuation which prompted a singling out of punctuation from other errors and to study its affect on editing time. This approach also helped answer the question of whether a certain type of error takes more time to correct than others.
The following results distinguish between WED with punctuation (WED_wp) and without punctuation (WED_wop). The data was categorized into two groups, highs (H) and lows (L), representing speeches with high or low WED, with and without punctuation. Table 2 summarizes average edit distance values for DB alignment, and Table 3 for AB alignment. High was chosen as and low as . The assumption is that speeches with both high WED_wp and low WED_wop will give insight into the effect of punctuation errors on editing time. Likewise, assuming speeches with low WED_wp and high WED_wop gives insight into the effect of all other errors on the editing time. The H-H group provides an opportunity to study deficiencies of our system that need to be addressed. The L-L group, on the other, gives an impression of the current performance ceiling. The table shows that excluding punctuation from transcripts improved the WED metric by for both AB and DB alignment in absolute terms, likely due to the removal of the start of sentence capitalization errors introduced by the punctuation module. However, the editors always get transcripts with punctuation, so the values of WED_wp are more relevant to editing effort. The same assumption is true with regards to DB alignment.
|Average||24.20 9.56||19.58 9.59|
|Average||20.12 5.80||14.35 4.78|
The average values of ET/W, RTF and speech count for DB alignment are summarized in Table 4. Singling out the L-L group also showed that when the transcription system is doing as well as it can, the corresponding ET/W (1.36) is comparable to the effort for manual transcripts (1.32). This corresponds to a 3% relative difference. The relative difference for RTF reached about 10%. On the other hand, the relative differences between the H-H group and the Fully Manual process reached 22.8% and 31.9% for ET/W and RTF respectively. The direct comparison between H-H and L-L shows a similar dramatic increase in both measures, clearly proving that the higher the edit distance, the higher the editing effort. The immediate concern for in-between groups is a lack of data to draw statistically significant conclusions.
The average values of ET/W, RTF and speech count for AB alignment is summarized in Table 5. The general trends for the H-H and L-L groups are identical to the DB alignment, confirming our initial hypothesis. This time, however, there were some points for in-between categories. The data shows that high WED_wop leads to higher edit times. Therefore, from the numbers themselves one might conclude that punctuation errors take less time to correct than other errors, most being ASR-based errors. This observation is further supported by fixing WED_wop as H and changing WED_wp. But the H-L cluster shows lower ET/W and RTF than even the manual process, which indicates that the punctuation errors are less severe than the other errors. The main issue, however, with these conclusions are too few data point so further experiments are needed to confirm or deny the validity of this finding.
Part of the experiment was also to obtain subjective evaluations of the system from the editors. The editors’ independent evaluation of most speeches showed that for the 234 graded speeches 26 were graded as Bad, 105 as Medium, and 103 as Good. Comparing the grades to the H-H/L-L groups, shows that 68%/70% of the L-L speeches were graded as Good and only 2%/1% as Bad, for Texts AB and Texts DB, respectively. In the H-H group 27%/21% of the speeches were graded as Bad and 18%/18% as Good, for Texts AB and Texts DB, respectively. These numbers are highly subjective and vary between editors but give an otherwise hard to obtain insight on the how in-house editors’ opinions line up with the other data.
Reading the comments the editors wrote about the speeches in these two groups show some differences. More of the speeches in the H-H groups have comments and the comments are longer. Most prominent are complaints about word substitutions and deletions, as well as incorrect punctuation. For these speeches the editors often mention that the speaker is hard to understand. Comments in the L-L group are fewer. However, complaints about incorrect capitalization are more prominent.
Multiple factors also contributed to a higher edit time and WED: dealing with Roman numerals, differences in repetitions or incorrect capitalization. But the edit time alone also had many factors contributing to it other than WED. Editors often formatted the text, such as splitting the speech into paragraphs. Sometimes, editors researched references to bills mentioned in the speeches. Other times, they had to pay close attention to the audio. Members of parliament would sometimes mention named-entities from different languages, which is outside of the scope of a monolingual ASR.
This paper evaluated the transcription system for the Icelandic Parliament, Althingi. The purpose of the system is to automatically transcribe parliamentary speeches, which are then manually edited by in-house editors and published on the Althingi website. The objective of the analysis was to gain insight on the system performance with respect to editing effort. The analysis focused on determining the relationship of the word edit distance with respect to edit time per word and the real-time factor of edits. The secondary focus was to evaluate the contribution of punctuation-related errors and the quality of automatically produced transcriptions as perceived by the editors.
The study shows that editors currently take more time to edit automatic transcripts than manual transcripts, as both observed measures, ET/W and RTF, were higher. Further analysis shows that high WED negatively affects edit time. Improving the automatic transcription to the level exhibited by manual transcription process would require lowering DB WED_wp to 12.6%. This conclusion is further supported by selectively looking at results for low error rate speeches, as the corresponding ET/W and RTF are similar to the resultant values for the fully manual transcription process. Despite the comments from editors, analyses do not show punctuation having a significant contribution to edit time. Further analysis is warranted before a decisive conclusion on this matter is reached. Based on the fact that only 11% of transcriptions received a bad grade, the Althingi in-house editors were satisfied with the experimental transcription system and its integration. Further teasing out and grouping of the factors would provide useful insights into what else an ASR integration to an existing transcription process requires.
-  Juan Daniel Valor Miró, Joan Albert Silvestre-Cerdà, Jorge Civera, Carlos Turró, and Alfons Juan, “Efficiency and usability study of innovative computer-aided transcription strategies for video lecture repositories,” Speech Communication, vol. 74, pp. 65–75, 2015.
-  Cosmin Munteanu, Ron Baecker, and Gerald Penn, “Collaborative editing for improved usefulness and usability of transcript-enhanced webcasts,” in Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 2008, pp. 373–382.
-  Henrich Kolkhorst, Kevin Kilgour, Sebastian Stüker, and Alex Waibel, “Evaluation of interactive user corrections for lecture transcription,” in International Workshop on Spoken Language Translation (IWSLT) 2012, 2012.
-  Matthias Sperber, Graham Neubig, Jan Niehues, Satoshi Nakamura, and Alex Waibel, “Transcribing against time,” Speech Communication, vol. 93, pp. 20–30, 2017.
-  Matthias Sperber, Graham Neubig, Satoshi Nakamura, and Alexander H. Waibel, “Optimizing computer-assisted transcription quality with iterative user interfaces,” in Proc. LREC 2016, 2016.
-  Juan Daniel Valor Miro, Pau Baquero-Arnal, Jorge Civera, Carlos Turro, and Alfons Juan, “Multilingual videos for MOOCs and OER,” Journal of Educational Technology & Society, vol. 21, no. 2, pp. 1–12, 2018.
-  Jon Gudnason, Oddur Kjartansson, Jökull Jóhannsson, Elín Carstensdóttir, Hannes Högni Vilhjálmsson, Hrafn Loftsson, Sigrún Helgadóttir, Kristín Jóhannsdóttir, and Eiríkur Rögnvaldsson, “Almannarómur: an open Icelandic speech corpus.,” in Proc. SLTU 2012, 2012, pp. 80–83.
-  Inga Run Helgadóttir, Róbert Kjaran, Anna B. Nikulásdóttir, and Jon Gudnason, “Building an ASR corpus using Althingi’s parliamentary speeches,” in Proc. Interspeech 2017, 2017, pp. 2163–2167.
-  Jon Gudnason, Matthías Pétursson, Róbert Kjaran, Simon Klüpfel, and Anna Björk Nikulásdóttir, “Building ASR corpora using Eyra,” in Proc. Interspeech 2017, 2017, pp. 2173–2177.
-  Anna B. Nikulásdóttir, Inga R. Helgadóttir, Matthías Pétursson, and Jon Gudnason, “Open ASR for Icelandic: Resources and a Baseline System,” in Proc. LREC 2018, 2018.
-  Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, et al., “The Kaldi speech recognition toolkit,” in IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2011.
-  Daniel Povey, Vijayaditya Peddinti, Daniel Galvez, Pegah Ghahrmani, Vimal Manohar, Xingyu Na, Yiming Wang, and Sanjeev Khudanpur, “Purely sequence-trained neural networks for ASR based on lattice-free MMI,” in Proc. Interspeech 2016, 2016, pp. 2751–2755.
-  Alex Waibel, Toshiyuki Hanazawa, Geoffrey Hinton, Kiyohiro Shikano, and Kevin J Lang, “Phoneme recognition using time-delay neural networks,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 37, no. 3, pp. 328–339, 1989.
-  Haşim Sak, Andrew Senior, and Françoise Beaufays, “Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition,” arXiv preprint arXiv:1402.1128, 2014.
-  Kenneth Heafield, “KenLM: Faster and smaller language model queries,” in Proc. of the Sixth Workshop on Statistical Machine Translation. Association for Computational Linguistics, 2011, pp. 187–197.
-  Eiríkur Rögnvaldsson, “The Icelandic speech recognition project Hjal,” Nordisk Sprogteknologi. Årbog, pp. 239–242, 2003.
-  Terry Tai, Wojciech Skut, and Richard Sproat, “Thrax: An open source grammar compiler built on OpenFst,” in IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2011.
-  Brian Roark, Richard Sproat, Cyril Allauzen, Michael Riley, Jeffrey Sorensen, and Terry Tai, “The OpenGrm open-source finite-state grammar software libraries,” in Proc. of the ACL 2012 System Demonstrations. Association for Computational Linguistics, 2012, pp. 61–66.
-  Ottokar Tilk and Tanel Alumäe, “Bidirectional recurrent neural network with attention mechanism for punctuation restoration.,” in Proc. Interspeech 2016, 2016, pp. 3047–3051.