Replacing Human Audio with Synthetic Audio for On-device Unspoken Punctuation Prediction

10/20/2020 ∙ by Daria Soboleva, et al. ∙ 30

We present a novel multi-modal unspoken punctuation prediction system for the English language which combines acoustic and text features. We demonstrate for the first time, that by relying exclusively on synthetic data generated using a prosody-aware text-to-speech system, we can outperform a model trained with expensive human audio recordings on the unspoken punctuation prediction problem. Our model architecture is well suited for on-device use. This is achieved by leveraging hash-based embeddings of automatic speech recognition text output in conjunction with acoustic features as input to a quasi-recurrent neural network, keeping the model size small and latency low.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent advances in on-device speech recognition technology [1] enable a range of applications where speech recognition is central to the user experience, such as voice dictation or live captioning [2]. However, a severe limitation of many Automatic Speech Recognition (ASR) systems is the lack of high quality punctuation. ASR systems usually do not predict any punctuation symbols which have not been explicitly spoken out, which makes ASR-transcribed text hard to read and understand for users.

In this paper, we address the issue by introducing an unspoken punctuation prediction system for refining ASR output. For example, given the input “hey Anna how are you” together with roughly corresponding audio segments, our system converts it to “Hey, Anna! How are you?” automatically. This is in contrast to spoken punctuation prediction, where relevant phrases in explicit spoken inputs, such as “hey comma Anna exclamation mark”, are converted into punctuation symbols “Hey, Anna!”.

Several methods for unspoken punctuation prediction have been proposed previously. These can be categorized based on input features: either relying on acoustic (prosodic) features [3], text features [4, 5] or on a multi-modal approach combining both [6].

Approaches relying only on text features suffer from lack of quality, especially for utterances with ambiguous punctuation which heavily depend on prosodic cues. Approaches that rely on audio generally use expensively annotated human audio recordings. Maintaining data quality across numerous speakers when collecting large amounts of human audio is often very costly and presents a roadblock for training larger models — even more so as punctuation marks suffer from semantic drift over time [7, 8].

In this paper, we combine text and acoustic features. To mitigate the scarcity of human recordings, we propose using synthetic audio, which allows us to obtain larger datasets and potentially train models for domains where no human audio is currently available, but a text-only dataset exists. Given the recent success of using text-to-speech (TTS) models to train ASR systems [9, 10, 11, 12], we explore synthetic audio generation approaches for punctuation prediction using prosody-aware Tacotron TTS models [13].

Firstly, we show that quality can be matched using only of the expensively collected data and lastly, that replacing the dataset entirely with synthesized audio using multiple TTS speakers outperforms models trained on the original dataset.

Our contributions are as follows:

  • [topsep=1pt, itemsep=1pt]

  • Introduce a novel multi-modal acoustic neural network architecture with low memory and latency footprints suitable for on-device use.

  • Achieve superior performance of acoustic and text based models compared to the text-only baseline with only a small increase in model parameter size.

  • Using only TTS-generated audio samples, we outperform models trained on expensively collected human recordings and annotations.

To the best of our knowledge, this is the first approach that successfully replaces human audio with synthetic audio for punctuation prediction.

2 Model Architecture

Figure 1:

Our multi-modal architecture on an example utterance. We concatenate text and acoustic features (“+”) and pass them through a bidirectional QRNN layer, followed by a fully-connected batch-normalized layer with softmax activation, classifying which punctuation symbol to append.

The model uses both the input audio and the output of an ASR system having processed that audio. The ASR output consists of the utterance transcription, together with approximate times in the utterance when a new token is emitted.

The boundaries only roughly represent transitions from one token in the hypothesis to the next one and cannot be used to directly estimate pause lengths, because common ASR systems (e.g. sequence-to-sequence 

[1] or connectionist temporal classification [14]) do not provide exact token alignments, or they may not be necessarily accurate at inference time. In our work, we use the on-device ASR system described in [1].

2.1 Text Features using Hash-based Embeddings

Representations for individual tokens are determined using a “hash-based embedding”, computed on the fly through a locality-sensitive string hashing operation that relies on the byte-level representation of the token, transforming it into a fixed dimensional representation of a requested size [15].

One of the main motivations for using hash-based embeddings is reducing the model size. Not having to store a learned matrix of dense embeddings for a vocabulary is advantageous for on-device use cases. For a relatively small vocabulary size of  tokens, even assuming a quantized -bit integer representation, the resulting matrix of commonly used -dimensional embeddings would increase the model size by roughly . Our models use roughly , meaning this would triple their size.

2.2 Pitch Estimation for Acoustic Features

Representations for audio segments corresponding to tokens are computed using pitch estimation [3], specifically using the YIN algorithm [16] together with a voiced or unvoiced estimation as described in [17]. This method provides pitch estimates (frequencies in ) for every of input audio for voiced segments. Assuming a sample rate of , we estimate  pitch value per  audio samples. The estimation method runs on the entire audio corresponding to the input utterance and the estimated pitch values are aligned using time-boundaries to the corresponding input tokens. Although ASR timings do not capture pauses between tokens explicitly, the estimation method produces values for unvoiced segments.

The token-aligned pitch values are then used to compute acoustic features per token. We used scalar statistics as acoustic features: , , , and (absolute difference between and

). The vector of computed acoustic scalar statistics are then concatenated with text embeddings and further used as input features (Fig. 


2.3 Quasi-Recurrent Neural Network Layers

Before passing input features into a time-dependent layer, we project them down to a

-dimensional vector using a fully-connected batch-normalized layer with ReLU activation.

The sequence of projected feature vectors is then passed through a bidirectional quasi-recurrent neural network (QRNN) layer [18]. We apply dropout on the hidden units of this layer, referred to as a zoneout [19]. The convolution-based QRNN architecture runs independently for each time-step and uses a time-dependent pooling layer across channels. In contrast to regular RNN layers, all computations except for the lightweight pooling, do not depend on previous time-steps, enabling faster inference.

Finally, the per-token outputs from the bidirectional QRNN layer are concatenated and passed into a softmax classification layer. At training time, we compute a cross-entropy loss with

regularization over model weights using the output probabilities for

classes: Period, Question Mark, Exclamation Mark, Comma and None. Each class, except for None, is mapped to the corresponding punctuation symbol appended to the input token.

This approach allows us to infer punctuation for sequences up to tokens long in less than .

3 Methods & Experiments

3.1 Data

Train Validation Test
samples samples samples tokens punct. EoS Period Question Mark Exclamation Mark Comma
Table 1: Number of samples in the training, validation, and test splits for the LibriTTS dataset. For the test split, frequency counts of labels are also provided.
Human / TTS Punctuation nAccuracy F1
EoS Period Question mark Exclamation mark Comma
10 / 00
/ 00
1 / 0
1 / 0
10 /
10 /
Table 2: Models are evaluated on LibriTTS test set with human audio. We report punctuation token accuracy for the dataset, F1-scores for individual punctuation symbols and for end-of-sentence (EoS). Human / TTS  indicates the percentage of text samples augmented with each audio type. Metrics are given in percentages and averaged over runs. Best results are in bold.

We use the publicly available LibriTTS corpus111 introduced in [20], which consists of  hours of English human speech, downsampled to . The dataset is derived from LibriSpeech [21], with the differences that it preserves original text including punctuation, speech is split at sentence boundaries, and utterances with significant background noise are removed. In order to replace human audio with synthetic audio, we augmented the training samples with the corresponding synthetically generated audio. The audio is synthesized using a Tacotron model, trained on a multi-speaker dataset as described in [13].

3.2 Data Augmentation with TTS

One of the natural advantages of synthetic TTS audio is that we can generate a large amount of it automatically given a text corpus and a TTS model. Hence, we propose a data augmentation technique: given a text sample, generate multiple synthetic audio samples with voices of different speakers. By doing so, we introduce additional audio samples improving the training data representation. With this approach, we can augment training data -fold, being the maximum number of speakers available for a given TTS model.

3.3 Preprocessing Details

3.3.1 Audio — Text-to-Speech Generation

Using Tacotron, we generate audio utterances corresponding to the preprocessed text samples. We experimented with  TTS English speakers in total. For model training and validation, we randomly split speakers into  training and  validation samples. When utilizing human LibriTTS utterances, we keep the original training, validation, and test split [20] (Table 3.1).

3.3.2 Text — Punctuation-aware Preprocessing

The raw text is first tokenized, separating whitespace, punctuation and word tokens based on Unicode word boundaries222 Subsequently, the tokens are organized into sentences based on end-of-sentence punctuation (Period, Question Mark, Exclamation Mark) to form samples. A sample can contain multiple sentences as long as the total number of tokens in it is at least and at most , and contains at least  punctuation mark.

3.3.3 Audio — Speaker and ASR Misrecognition Filtering

Speech recognition systems are not perfect, and misrecognitions need to be dealt with. For training, we drop samples for which the number of ASR-recognized tokens is not equal to the number of tokens in the ground truth transcription, as for these it would be non-trivial to reconstruct the correct punctuation. During this stage, we filter out roughly samples.

3.4 Training Details

3.4.1 Weighted Cross-Entropy Loss

The distribution of punctuation symbols in text is highly nonuniform (Table 1). To mitigate this, we introduce a class-weighted cross-entropy loss, with weights equal to the inverse of punctuation mark occurrence frequency for each class, calculated on the training split [22]

. This modification of the loss function ensures that tasks with extensive data available (such as

Period or Comma) do not overly dominate the training, resulting in poor performance on classes with small amount of data (such as Question Mark or Exclamation Mark).

3.4.2 Model Hyperparameters

The model configurations, for all the results reported in the following section, use a hash-based embedding dimension of , QRNN convolution kernel width of , hidden state size of and zone-out probability of .

We train all models using the Adam optimizer [23] with an initial learning rate of  which is exponentially decayed by  every  steps. All models are trained for a total of  steps with batch size . The total number of trainable parameters for text-only models is , and it increases for text-acoustic models by only to .

4 Results

4.1 Comparing Acoustic & Text-only Models

To motivate the use of acoustic features, we first compare the text-only model ( Human, TTS) with an acoustic model that utilizes only human audio ( Human). The results in Table 2 show that the acoustic model outperforms the text-only model in punctuation accuracy and across most individual punctuation mark F1-scores. Comma prediction quality is significantly better, which demonstrates the capability of the model to implicitly infer pauses, despite explicit segmentation from ASR not being present. There is some quality degradation for Exclamation Mark, which the model mostly confused with Period. We address this issue in followup experiments.

4.2 Human & TTS Audio Combinations

Given the amount of text data available nowadays, we can train large models for punctuation prediction. However, with the lack of human audio, we cannot utilize acoustic features. To mitigate this issue, we propose partially replacing human audio with synthetic TTS audio to find out how much human audio is necessary to maintain the same model performance. To obtain a mix with the desired percentage of each audio type, during preprocessing, we perform sampling from the appropriate Bernoulli distribution for every text instance to decide which audio type to use for that sample.

In Table 2, we can see that replacing of human audio with TTS audio suffers from lower quality Comma predictions. However, by adding just human audio, we are able to not only match the performance, but moreover improve quality for Question Mark. Hence, one could have annotated only of the LibriTTS corpus with human audio and utilized TTS audio for the remainder of the data, without sacrificing punctuation model quality.

Models with human audio show even better results for almost all punctuation marks. Note that mixing of different audio types produces consistently better punctuation models. We believe this happens because of a greater data diversity in the audio space — e.g. classes such as Exclamation mark or Question mark are often mistaken for Period because of pronunciation. Using different types of audio or different speakers increases the chance of obtaining more general training data, which improves model quality. In our experiments, mixes with , , and human audio do not yield further improvements.

4.3 Data Augmentation with Text-to-Speech Audio

To reduce the reliance on human recorded and annotated audio, we synthetically generate all audio in the dataset. For each sample we select at random out of the TTS-speakers to generate two audio variants of the same text sample. This enables us to obtain a

larger dataset compared to the original one, with greater speaker variance.

Table 2 shows that models trained on this synthetic datasets outperform models trained on human audio in punctuation accuracy and F1-score for Exclamation Mark. When using human audio, this class is commonly mistaken with Period and the quality strictly depends on how human speakers pronounce sentences with it.

By leveraging a multi-speaker prosody-aware TTS model, the synthetic dataset proves to better represent acoustic changes relevant for punctuation prediction. The model training incorporates more variability in the feature space when using the synthetic dataset, compared to the original LibriTTS dataset, resulting in better performance. In our experiments, sampling more than speakers per text sample does not yield additional improvements.

5 Discussion and Conclusion

In this paper, we propose a multi-modal punctuation prediction system that utilizes both text and acoustic features, while maintaining a practical setup that can run on-device, with low memory and latency footprints. According to our experiments on LibriTTS, only text data needs to be paired with human recordings and the rest can be synthesized by a TTS model without any quality loss. Additionally, models trained using our TTS data augmentation with multiple speakers outperform models trained on human audio. This allows us to alleviate the inherent data scarcity limitation of acoustic-based systems and enable training punctuation models for any language for which a big-enough text corpus and a TTS model are available. In future work, we seek to explore training TTS models on in-domain data and further improving quality for Comma and other individual classes.


  • [1] Y. He, T. N. Sainath, R. Prabhavalkar, et al., “Streaming End-to-end Speech Recognition For Mobile Devices,” 2018.
  • [2] M. T. Ramanovich and N. Bar, “On-Device Captioning with Live Caption,” Blog post,, October 2019.
  • [3] F. Batista, H. Moniz, I. Trancoso, and N. Mamede, “Bilingual experiments on automatic recovery of capitalization and punctuation of automatic speech transcripts,” IEEE transactions on audio, speech, and language processing, vol. 20, no. 2, pp. 474–485, 2012.
  • [4] D. Beeferman, A. Berger, and J. Lafferty, “Cyberpunc: a lightweight punctuation annotation system for speech,” in Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP ’98 (Cat. No.98CH36181), 1998, vol. 2, pp. 689–692 vol.2.
  • [5] O. Tilk and T. Alumäe,

    Bidirectional Recurrent Neural Network with Attention Mechanism for Punctuation Restoration,”

    in Interspeech, 2016, pp. 3047–3051.
  • [6] O. Klejch, P. Bell, and S. Renals, “Sequence-to-sequence models for punctuated transcription combining lexical and acoustic features,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 5700–5704.
  • [7] P. Bruthiaux, “The Rise and Fall of the Semicolon: English Punctuation Theory and English Teaching Practice,” Applied Linguistics, vol. 16, no. 1, pp. 1–14, 03 1995.
  • [8] P. Bruthiaux, “Knowing when to stop: Investigating the nature of punctuation,” Language & Communication, vol. 13, no. 1, pp. 27 – 43, 1993.
  • [9] Y. Ren, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, “Almost unsupervised text to speech and automatic speech recognition,” arXiv preprint arXiv:1905.06791, 2019.
  • [10] A. Rosenberg, Y. Zhang, B. Ramabhadran, Y. Jia, P. Moreno, Y. Wu, and Z. Wu, “Speech recognition with augmented synthesized speech,” in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2019, pp. 996–1002.
  • [11] G. Wang, A. Rosenberg, Z. Chen, Y. Zhang, B. Ramabhadran, Y. Wu, and P. Moreno, “Improving speech recognition using consistent predictions on synthesized speech,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 7029–7033.
  • [12] A. Tjandra, S. Sakti, and S. Nakamura,

    “Listening while speaking: Speech chain by deep learning,”

    in 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2017, pp. 301–308.
  • [13] RJ Skerry-Ryan, E. Battenberg, Y. Xiao, Y. Wang, D. Stanton, J. Shor, R. J. Weiss, R. Clark, and R. A. Saurous, “Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron,” 2018.
  • [14] A. Graves and N. Jaitly, “Towards end-to-end speech recognition with recurrent neural networks,” in

    International conference on machine learning

    , 2014, pp. 1764–1772.
  • [15] P. Kaliamoorthi, S. Ravi, and Z. Kozareva, “PRADO: Projection attention networks for document classification on-device,” in

    Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing

    , Hong Kong, China, Nov. 2019.
  • [16] A. D. Cheveigné and H. Kawahara, “YIN, a fundamental frequency estimator for speech and music,” The Journal of the Acoustical Society of America, vol. 111, no. 4, pp. 1917–1930, 2002.
  • [17] D. Talkin and B. W. Kleijn, “A robust algorithm for pitch tracking (RAPT),” Speech coding and synthesis, vol. 495, pp. 518, 1995.
  • [18] J. Bradbury, S. Merity, C. Xiong, and R. Socher, “Quasi-Recurrent Neural Networks,” 2016.
  • [19] D. Krueger, T. Maharaj, J. Kramár, M. Pezeshki, N. Ballas, N. R. Ke, A. Goyal, Y. Bengio, A. Courville, and C. Pal, “Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations,” 2016.
  • [20] H. Zen, V. Dang, R. Clark, Y. Zhang, R. J. Weiss, Y. Jia, Z. C., and Y. Wu, “LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech,” 2019.
  • [21] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An ASR corpus based on public domain audio books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 5206–5210.
  • [22] Y. Cui, M. Jia, T.-Y. Lin, Y. Song, and S. Belongie, “Class-balanced loss based on effective number of samples,” in

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    , June 2019.
  • [23] D. P. Kingma and J. L. Ba, “Adam: A Method for Stochastic Optimization,” International Conference on Learning Representations (ICRL), Dec. 2015.
  • [24] B. Gfeller, C. Frank, D. Roblek, M. Sharifi, M. Tagliasacchi, and M. Velimirović, “SPICE: Self-supervised Pitch Estimation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 1118–1128, 2020.

Appendix A Enhanced Acoustic Features

Acoustic Feature Punctuation nAccuracy F1
EoS Period Question mark Exclamation mark Comma
Table 3: Extraction of different acoustic features. Training on TTS generated audio. Evaluation on LibriTTS test set with human audio. We report punctuation token accuracy for the dataset, F1-scores for each individual punctuation symbols and for end-of-sentence (EoS), where Period, Exclamation Mark, and Question Mark are considered interchangeable. Metrics are given in percentages and averaged over runs. Best results are in bold.

To improve the quality of our best models trained on TTS synthetic audio (Table 2), we experimented with different types of acoustic features.

First, we assess a different pitch estimation technique which relies on a pre-trained version of the SPICE model333 described in [24]. The model outputs relative normalized pitch changes together with an uncertainty value for every of audio (every -samples for audio). We only rely on the normalized pitch change estimates for experiments outlined in this paper. The pre-trained SPICE model size is roughly , which increases the size of the on-device stack compared to the YIN-based estimation [16] used previously.

Another type of acoustic features we extract from audio are -dimensional log-mel filterbank energies. Log-mel features are a commonly used in speech processing systems [1, 13]. Moreover, they capture a richer voice signal representation compared to pitch values. Thus, we propose extracting more high-level features by adding another convolution layer applied on the last dimension of our log-mel features. Based on our experiments, convolutions with smaller kernels tend to work better, thus we present results with kernel size equal to . By adding this convolution layer, we increase the model size only by approximately to .

In Table 3, we can see that SPICE model-based estimation of pitch (Pitch-SPICE) does not yield an improvement of over YIN estimation (Pitch-YIN). We believe that quality of Pitch-SPICE models can be further improved by retraining the SPICE model on an in-domain audio corpus.

Log-mel features, however, allow us to significantly improve punctuation accuracy. We also observed a slight quality loss for individual punctuation symbols. We believe this happens because the model is keen to predict more punctuation symbols than it is supposed to based on ground truth labels.

In many of the failure cases, we observed the inherent ambiguity of punctuation prediction. For example, the sentence “hey Anna” can be punctuated in many different ways, as “hey, Anna!” or “hey Anna!”. In our current implementation, we do not distinguish such ambiguous cases from clear failures. In future work, we would like to enhance methods for result interpretation as well as explore further representations of audio.