Spoken language transcripts generated by automatic speech recognition (ASR) systems usually have no punctuation marks. And for spontaneous speech, ASR transcripts often include a lot of speech disfluencies. However, many subsequent applications, such as machine translation and dialogue systems, are usually trained on well-formed text with proper punctuation marks and without disfluencies. Hence, there is a significant mismatch between the training corpora and the actual speech transcript input for these applications, causing dramatic performance degradation. In addition, the lack of punctuation marks and the presence of disfluencies reduce the readability of speech transcripts. Consequently, predicting punctuation and detecting disfluencies (and removing detected disfluencies) have become crucial post-processing tasks for speech transcripts.
One example of speech transcript is “I want a flight to Boston um to Denver”. For punctuation prediction, we annotate whether there is a specific type of punctuation mark after a word, such as period, comma, etc. In this case, there is a period after the word “Denver”. The annotation of disfluency includes the reparandum and interregnum. Reparandum includes words that are corrected by the following words or are to be discarded. Reparandum includes repetition, repair, and restart. Interregnum includes filled pauses, discourse markers, etc. In this case, the phrase “to Boston” is annotated as reparandum and “um” is annotated as interregnum.
A critical challenge for punctuation prediction and disfluency detection for real-time spoken language processing systems is latency. For example, simultaneous translation systems  require fixed partial post-processed speech transcript and decode it partially (prefix-to-prefix framework) to minimize latency. In this work, we tackle the challenge of reducing the latency from two aspects, the modeling approach and the decoding strategy, while achieving a high accuracy performance. Previous state-of-the-art approaches use a Transformer encoder-decoder model for punctuation prediction  and disfluency detection . The encoder-decoder model consists of the encoder and the auto-regressive decoder, which prevents the model from massive parallelization during inference. Hence, it is difficult to employ such a model in a real-time punctuation prediction and disfluency detection system due to its low inference efficiency. Past research [4, 5] showed that jointly modeling punctuation prediction and disfluency detection can improve their generalization capabilities, enhance the overall efficiency of the pipeline, and avoid error propagation. Inspired by the success of self-attention mechanisms for sequence labeling tasks , we propose a Controllable Time-delay Transformer (CT-Transformer) model which jointly models punctuation prediction and disfluency detection. In order to achieve punctuation prediction and disfluency detection in real time, the proposed CT-Transformer model only uses the encoder part of the Transformer encoder-decoder model structure [2, 3]. Longer context is preferred for better prediction and detection performance, but it also results in higher latency. Compared to cutting off context during inference, CT-Transformer provides a principled way for freezing partial outputs with controllable time delay to fulfill the real-time constraints in partial decoding required by subsequent applications.
Previous work studied different decoding strategies to reduce the latency for real-time spoken language processing systems, including overlapping windows , streaming input scheme , and overlapped-chunk split and merging strategy . However, the input text for inference in these decoding strategies does not always begin with the first word of a sentence. Hence these strategies may ignore crucial context information for predicting punctuation and detecting disfluency. Tilk et al.  proposed a decoding strategy which always begins with the first word of a sentence. Since this approach partitions the input sequence into 200-word slices, it cannot be used in real-time streaming systems due to its high latency. We propose a fast decoding strategy to minimize latency while maintaining competitive performance. This strategy guarantees that the input text for inference always begins with the first word of a sentence. Meanwhile, to reduce the computational complexity, the strategy dynamically throws away a history that is too long based on already predicted punctuation marks.
In addition, most previous approaches to punctuation prediction and disfluency detection are supervised approaches and rely heavily on human-annotated speech transcripts, which are expensive to obtain. To tackle the training data bottleneck, we investigate transfer learning to exploit existing large-scale well-formed text corpora.
Our contributions can be summarized as follows: 1) We propose a Controllable Time-delay Transformer (CT-Transformer) model to jointly model punctuation prediction and disfluency detection through multi-task learning. CT-Transformer provides a principled approach that facilitates freezing partial outputs with controllable time delay to fulfill the real-time constraints in partial decoding required by subsequent applications. To the best of our knowledge, this is the first work that employs self-attention networks for jointly modeling punctuation prediction and disfluency detection, and is the first work that provides the controllable time-delay capability for these tasks. 2) We propose a fast decoding strategy to minimize latency while maintaining competitive performance for stream processing. 3) We investigate transfer learning to utilize existing large-scale well-formed text corpora. 4) Experimental results on the IWSLT2011 benchmark test set and an in-house Chinese annotated dataset show that our approach outperforms the previous state-of-the-art models on F-scores with a competitive latency, and fulfill the real-time constraints.
2 Related Work
Punctuation prediction models can be categorized into three major categories, including hidden inter-word event detection 
(n-gram language models
, Hidden Markov Models (HMMs)), sequence labeling by assigning a punctuation mark to each word [14, 15] (conditional random fields (CRFs) 
, convolutional neural networks (CNNs)
, recurrent neural networks (RNNs) and its variants [10, 19]), and sequence-to-sequence modeling in which the source is unpunctuated text and the target is punctuated text  or sequences of punctuation marks [2, 21].
For disfluency detection, previous methods can be categorized into four categories: sequence labeling, parsing-based, noisy channel model, and encoder-decoder model. The sequence labeling method labels each word as fluent or not using different model structures, including CRFs , HMMs , RNNs  or others [24, 25]. Noisy channel models use the similarity between reparandum and repair as an indicator of disfluency [26, 27]. The parsing-based approaches jointly model syntactic parsing and disfluency detection tasks [28, 29]. The encoder-decoder models defines disfluency detection as a sequence-to-sequence problem [30, 3].
Previous work also investigated masked self-attention mechanisms for natural language processing. Shen et al. proposed diag-disabled mask, forward mask, and backward mask for language understanding. Song et al.  investigated local mask and directional mark in Transformer for machine translation. However, these masked self-attention mechanisms are different from our proposed controllable time-delay self-attention, as explained in Section 3.2.
3 Proposed Approach
The proposed model is illustrated in Figure 1. The inputs are transcripts, e.g., “I want a flight to Boston um to Denver”. The outputs are punctuation and disfluency labels using the BIO scheme , e.g., “O O O O O O O O .” and “O O O O B-RM I-RM B-IM O O”, where “B”, “I”, “O” denote Beginning, Inside, and Outside of a text segment, and “RM” and “IM” denote reparandum and interregnum.
3.1 Model Architecture
The input embedding consists of word embeddings and position embeddings (sinusoidal position encoding). The encoder consists of a stack of
layers. Each layer has two sub-layers, i.e., the multi-head self-attention sub-layer and the fully connected feed-forward network sub-layer. The output layers consist of the punctuation tagging layer and the disfluency tagging layer. The encoder is shared between the two tasks while the tagging layers are separate for each task, following the multi-task learning paradigm. The final hidden states of the encoder are fed into the corresponding softmax layers for classifying over the punctuation labels and disfluency labels, respectively. The total loss is the summation of the cross entropy losses of punctuation prediction and disfluency detection.
3.2 Controllable Time-delay Self-attention
Different from the full sequence self-attention Transformer encoder in the original Transformer model , we propose a controllable time-delay self-attention mechanism to encourage the model to depend on future words in a shorter time window instead of the full sequence, thus the partial outputs can be fixed to fulfill the real-time constraints in partial decoding required by subsequent applications. The original self-attention mechanism builds upon the scaled dot-product attention, operating on query , key , and value :
where is the dimension of the keys. To encourage punctuation prediction and disfluency detection to depend on the future words in a shorter time window, we need to block the flow of information from distant future words into the encoder. To achieve this goal, we modify the scaled dot-product attention by masking out (setting to ) all values in the input to the softmax layer which correspond to the unwanted distant future words (illegal connections). We denote this new mechanism the controllable time-delay self-attention (denoted CT self-attention). Equation 1 is modified as
where matrix is
In the mask , there is only attention for position j for the fixed length of future words and all history words with respect to the current position i. This is illustrated in the bottom of Figure 1. The fixed length is in the “CT-Transformer Layer 1”, which is usually used in the encoder-decoder framework to preserve the auto-regressive property . The fixed length is in the “CT-Transformer Layer 2”, thus the total number of the seen future words is here. The maximum number of the seen future words for each word in CT-Transformer is , where denotes the fixed length in Layer of the encoder. The CT self-attention is an extension of previous forward mask  and local mask . If in the CT self-attention, the CT self-attention degenerates into the forward mask; if there is only attention for the fixed length of history words in the CT self-attention, it becomes the local mask.
3.3 Fast Decoding Strategy
To simulate the actual streaming scenario for real-time punctuation prediction and disfluency detection systems, we remove segmentation from transcripts. Hence there is only a single input utterance in our evaluations, keeping the same setup with previous work . During training, half of the samples (utterances) are appended with randomly truncated segments to encourage the model not to always predict the end-of-utterance punctuation in the end. We propose a fast decoding strategy with a low frame rate, as shown in Algorithm 1, to reduce latency while maintaining a competitive performance.
We evaluate punctuation prediction on the English IWSLT2011 benchmark dataset. We evaluate both punctuation prediction and disfluency detection on an in-house Chinese dataset. The IWSLT2011 benchmark contains three types of punctuation marks (comma, period, and question mark), following the data organization and using the same tokenized data by Che et al. 111https://github.com/IsaacChanghau/neural_sequence_labeling. Since no public Chinese corpus with both punctuation and disfluency annotations is available at the time of this work, we annotate transcripts for about 240K spoken utterances with punctuation and disfluency annotations, and randomly partition the data into the train, dev, and test sets. We use Jieba222https://github.com/fxsjy/jieba for word segmentation. The punctuation annotations consist of four types of punctuation marks (comma, period, question mark, and enumeration comma). We use the BIO scheme to annotate the two types of disfluencies, reparandum and interregnum, for sequence labeling of disfluency detection. Note that in this work, the train, dev, and test sets of both IWSLT2011 and Chinese datasets are manual transcripts. In future work, we will evaluate the robustness of the proposed approach on ASR transcripts.
In order to reduce reliance on expensive annotations on speech transcripts, we explore the pre-training and fine-tuning transfer learning method and existing large-scale well-formed text for pre-training for our tasks. We crawl public Internet resources (news, Wikipedia, question-answering data, discussion forums, etc) to create two large-scale corpora for English and Chinese, respectively. We use heuristic rules to map all the punctuation marks in the crawled text to the punctuation marks for the English IWSLT2011 dataset and the Chinese dataset, respectively. For the crawled Chinese text, we randomly insert reparandum and interregnum using heuristic rules similar to for disfluency detection. We use Jieba for word segmentation for the crawled Chinese text. The processed English and Chinese crawled text are used for pre-training, denoted IWSLT2011 Train-pretrain and Chinese Train-pretrain datasets, respectively. The data statistics are summarized in Table 1. We evaluate punctuation and disfluency detection using token-based precision (P), recall (R), F-score (F), following previous works [18, 27].
|BLSTM w/o Pretrain||53.1||48.3||50.6||66.9||70.0||68.4||70.0||45.7||55.3||60.6||58.6||59.6|
|Full-Transformer w/o Pretrain||56.8||56.0||56.4||68.5||75.6||71.9||59.6||67.4||63.3||62.8||65.7||64.2|
|CT-Transformer w/o Pretrain||53.3||61.8||57.2||76.2||64.3||69.7||67.5||58.7||62.8||62.9||62.9||62.9|
4.2 Training Details
For IWSLT2011, we only have punctuation annotations, so there is only the punctuation tagging layer in Figure 1 for this dataset. The encoder consists of a stack of 6 layers. There are parallel attention layers, or heads. For each of these heads, we use . The dimension of the inner-layer is 2048. Adam 
with gradient clipping and warm-up is used for optimization. The fixed lengthin CT-Transformer is set to , and , . The low frame rate is and the number of look-ahead words after end-of-sentence mark is . For IWSLT2011, the batch size is 600 for pre-training and 32 for fine-tuning. For the Chinese dataset, the batch size is 600 for both pre-training and fine-tuning. All these hyper-parameters are optimized on the development sets based on accuracy and latency.
|Model||Comma||Period||Question||Enum. Comma||Overall||Inference Time|
4.3 Results and Discussions
We evaluate the proposed CT-Transformer model, together with two counterparts for punctuation prediction and disfluency detection. BLSTM denotes the model that replaces the CT-Transformer block in Figure 1 with bidirectional LSTM, which has a hidden size of 512 and 6 layers, keeping the size comparable with that of CT-Transformer. Full-Transformer denotes the model that replaces the CT-Transformer block in Figure 1 with a full sequence Transformer. The results of these models on the IWSLT2011 test set are reported in the last group in Table 2. “Overall” denotes the micro-average of scores for all types of punctuation marks. Both Full-Transformer and CT-Transformer outperform BLSTM on overall F. CT-Transformer achieves better overall F than Full-Transformer (74.9% versus 73.8%). Removing pre-training degrades the performance of CT-Transformer and Full-Transformer significantly, and CT-transformer without pre-training yields worse F than Full-Transformer (62.9% versus 64.2%). The first group of models and results in Table 2 are cited from previous works. T-LSTM  used a uni-directional LSTM model and T-BRNN-pre  used a bidirectional RNN with attention. BLSTM-CRF and Teacher-Ensemble are the best single and ensemble models in , respectively. The previous state-of-the-art model is Self-attention-word-speech , which used a full sequence Transformer encoder-decoder model with pre-trained word2vec and speech2vec embedding features. Our proposed CT-Transformer significantly outperforms the previous state-of-the-art model (Self-attention-word-speech) (74.9% versus 72.9%).
We also compare CT-Transformer with the two counterparts on the Chinese dataset for both punctuation prediction (Table 3) and disfluency detection (Table 4). Firstly, we observe that CT-Transformer with multi-task learning outperforms CT-Transformer trained only on the punctuation prediction task (58.8% versus 58.4%). Hence, we only compare models with multi-task learning in the following experiments. As shown in Table 3, CT-Transformer achieves a significantly better overall F than BLSTM (58.8% versus 53.9%) and is comparable with Full-Transformer. Using Intel Xeon Platinum 8163 CPU for inference, compared with BLSTM, Full-Transformer is 1.6x faster in the inference time and CT-Transformer is 1.9x faster, as shown in Table 3 333Note that for the first group of models in Table 2, T-BRNN-pre partitions the input sequence into 200-word slices, which cannot be used in real-time streaming systems due to its high latency. The other works did not report their inference time or release the source code to test the inference time.. Compared with the latest overlapped-chunk split and merging strategy , the proposed fast decoding with CT-Transformer has lower latency (10 words versus 20 words) and better overall punctuation prediction F (58.8% versus 57.8%) 444The overlapped-chunk split and merging strategy uses a chunk size 30, sliding window 15, and min_words_cut 10 as in ..
Table 4 shows the results of detecting reparandum, interregnum, and either, on the in-house Chinese test set. We observe that it is much easier to detect interregnum than reparandum. For detecting either disfluency type, CT-Transformer achieves a significantly better F than BLSTM (70.5% versus 67.9%), and is comparable with Full-Transformer. Since the previous state-of-the-art model for disfluency detection is a full-sequence Transformer model , these results show that CT-Transformer achieves comparable accuracy to the previous state of the art with lower latency.
Figure 2 shows the histogram of the max punctuation-position change during decoding on the Chinese test set for the three models. The upper limit of the max position change is 9 for CT-Transformer, but up to 63 for BLSTM and up to 42 for Full-Transformer. BLSTM and Full-Transformer both have about 10% cases of 10+ max position change. These results verify that the proposed CT-Transformer can indeed control time delay which is difficult for Full-Transformer.
We propose Controllable Time-delay Transformer (CT-Transformer) to jointly model punctuation prediction and disfluency detection, which facilitates freezing partial outputs with controllable time-delay to fulfill the real-time constraints in partial decoding required by subsequent applications. We further propose a fast decoding strategy to reduce the latency while maintaining competitive performance, and explore transfer learning to utilize existing well-formed text. Experimental results demonstrate that CT-Transformer outperforms previous state-of-the-art models on both F-score and latency on the English IWSLT2011 benchmark and an in-house Chinese dataset. Future work includes improving the robustness of our models on ASR transcripts and multilingual transcripts.
-  Mingbo Ma, Liang Huang, Hao Xiong, Renjie Zheng, Kaibo Liu, Baigong Zheng, Chuanqiang Zhang, Zhongjun He, Hairong Liu, Xing Li, Hua Wu, and Haifeng Wang, “STACL: simultaneous translation with implicit anticipation and controllable latency using prefix-to-prefix framework,” in ACL, 2019, pp. 3025–3036.
-  Jiangyan Yi and Jianhua Tao, “Self-attention based model for punctuation prediction using word and speech embeddings,” in ICASSP, 2019, pp. 7270–7274.
-  Qianqian Dong, Feng Wang, Zhen Yang, Wei Chen, Shuang Xu, and Bo Xu, “Adapting translation models for transcript disfluency detection,” in AAAI, 2019, pp. 6351–6358.
-  Xuancong Wang, Khe Chai Sim, and Hwee Tou Ng, “Combining punctuation and disfluency prediction: An empirical study,” in EMNLP, 2014, pp. 121–130.
-  Don Baron, Elizabeth Shriberg, and Andreas Stolcke, “Automatic punctuation and disfluency detection in multi-party meetings using prosodic and lexical cues,” in ICSLP, 2002.
-  Zhixing Tan, Mingxuan Wang, Jun Xie, Yidong Chen, and Xiaodong Shi, “Deep semantic role labeling with self-attention,” in AAAI, 2018, pp. 4929–4936.
-  Eunah Cho, Jan Niehues, and Alex Waibel, “Segmentation and punctuation prediction in speech language translation using a monolingual translation system,” in IWSLT, 2012, pp. 252–259.
-  Eunah Cho, Jan Niehues, Kevin Kilgour, and Alex Waibel, “Punctuation insertion for real-time spoken language translation,” in IWSLT, 2015.
-  Binh Nguyen, Vu Bao Hung Nguyen, Hien Nguyen, Pham Ngoc Phuong, The-Loc Nguyen, Quoc Truong Do, and Luong Chi Mai, “Fast and accurate capitalization and punctuation for automatic speech recognition using transformer and chunk merging,” CoRR, vol. abs/1908.02404, 2019.
Ottokar Tilk and Tanel Alumäe,
“Bidirectional recurrent neural network with attention mechanism for punctuation restoration,”in Interspeech, 2016, pp. 3047–3051.
-  Yang Liu, Elizabeth Shriberg, Andreas Stolcke, Dustin Hillard, Mari Ostendorf, and Mary P. Harper, “Enriching speech recognition with automatic detection of sentence boundaries and disfluencies,” IEEE TASLP, vol. 14, no. 5, pp. 1526–1540, 2006.
-  Doug Beeferman, Adam L. Berger, and John D. Lafferty, “Cyberpunc: a lightweight punctuation annotation system for speech,” in ICASSP, 1998, pp. 689–692.
-  Heidi Christensen, Yoshihiko Gotoh, and Steve Renals, “Punctuation annotation using statistical prosody models,” in ITRW, 2001.
-  Nicola Ueffing, Maximilian Bisani, and Paul Vozila, “Improved models for automatic punctuation prediction for spoken and written text,” in INTERSPEECH, 2013, pp. 3097–3101.
-  Piotr Zelasko, Piotr Szymanski, Jan Mizgajski, Adrian Szymczak, Yishay Carmiel, and Najim Dehak, “Punctuation prediction model for conversational speech,” in Interspeech, 2018, pp. 2633–2637.
-  Wei Lu and Hwee Tou Ng, “Better punctuation prediction with dynamic conditional random fields,” in EMNLP, 2010, pp. 177–186.
Xiaoyin Che, Cheng Wang, Haojin Yang, and Christoph Meinel,
“Punctuation prediction for unsegmented transcript based on word vector,”in LREC, 2016.
-  Ottokar Tilk and Tanel Alumäe, “LSTM for punctuation restoration in speech transcripts,” in INTERSPEECH, 2015, pp. 683–687.
-  Jiangyan Yi, Jianhua Tao, Zhengqi Wen, and Ya Li, “Distilling knowledge from an ensemble of models for punctuation prediction,” in Interspeech, 2017, pp. 2779–2783.
-  Stephan Peitz, Markus Freitag, Arne Mauser, and Hermann Ney, “Modeling punctuation prediction as machine translation,” in IWSLT, 2011, pp. 238–245.
-  Ondrej Klejch, Peter Bell, and Steve Renals, “Punctuated transcription of multi-genre broadcasts using acoustic and lexical approaches,” in SLT, 2016, pp. 433–440.
-  Mari Ostendorf and Sangyun Hahn, “A sequential repetition model for improved disfluency detection,” in INTERSPEECH, 2013, pp. 2624–2628.
-  Julian Hough and David Schlangen, “Recurrent neural networks for incremental disfluency detection,” in INTERSPEECH, 2015, pp. 849–853.
“Using integer linear programming for detecting speech disfluencies,”in NAACL, 2009, pp. 109–112.
-  Shaolei Wang, Wanxiang Che, Qi Liu, Pengda Qin, Ting Liu, and William Yang Wang, “Multi-task self-supervised learning for disfluency detection,” CoRR, vol. abs/1908.05378, 2019.
-  Mark Johnson and Eugene Charniak, “A tag-based noisy-channel model of speech repairs,” in ACL, 2004, pp. 33–39.
-  Paria Jamshid Lou and Mark Johnson, “Disfluency detection using a noisy channel model and a deep neural language model,” in ACL, 2017, pp. 547–553.
-  Mohammad Sadegh Rasooli and Joel R. Tetreault, “Joint parsing and disfluency detection in linear time,” in EMNLP, 2013, pp. 124–129.
-  Masashi Yoshikawa, Hiroyuki Shindo, and Yuji Matsumoto, “Joint transition-based dependency parsing and disfluency detection for automatic speech recognition texts,” in EMNLP, 2016, pp. 1036–1041.
-  Graham Neubig, Yuya Akita, Shinsuke Mori, and Tatsuya Kawahara, “A monotonic statistical machine translation approach to speaking style transformation,” Computer Speech & Language, vol. 26, no. 5, pp. 349–370, 2012.
-  Tao Shen, Tianyi Zhou, Guodong Long, Jing Jiang, Shirui Pan, and Chengqi Zhang, “Disan: Directional self-attention network for rnn/cnn-free language understanding,” in AAAI, 2018, pp. 5446–5455.
-  Kaitao Song, Tan Xu, Furong Peng, and Jianfeng Lu, “Hybrid self-attention network for machine translation,” CoRR, vol. abs/1811.00253, 2018.
-  Lance A. Ramshaw and Mitch Marcus, “Text chunking using transformation-based learning,” in VLC@ACL, 1995.
-  Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin, “Attention is all you need,” in NIPS, 2017, pp. 5998–6008.
-  Diederik P. Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2014.