Log In Sign Up

Representation Learning for Discovering Phonemic Tone Contours

by   Bai Li, et al.

Tone is a prosodic feature used to distinguish words in many languages, some of which are endangered and scarcely documented. In this work, we use unsupervised representation learning to identify probable clusters of syllables that share the same phonemic tone. Our method extracts the pitch for each syllable, then trains a convolutional autoencoder to learn a low dimensional representation for each contour. We then apply the mean shift algorithm to cluster tones in high-density regions of the latent space. Furthermore, by feeding the centers of each cluster into the decoder, we produce a prototypical contour that represents each cluster. We apply this method to spoken multi-syllable words in Mandarin Chinese and Cantonese and evaluate how closely our clusters match the ground truth tone categories. Finally, we discuss some difficulties with our approach, including contextual tone variation and allophony effects.


page 1

page 2

page 3

page 4


Improved Representation Learning Through Tensorized Autoencoders

The central question in representation learning is what constitutes a go...

Deep Clustering for Mars Rover image datasets

In this paper, we build autoencoders to learn a latent space from unlabe...

Learning Graph Representation via Formal Concept Analysis

We present a novel method that can learn a graph representation from mul...

Representation Learning for Short Text Clustering

Effective representation learning is critical for short text clustering ...

A Brief Overview of Unsupervised Neural Speech Representation Learning

Unsupervised representation learning for speech processing has matured g...

1 Introduction

Tonal languages use pitch to distinguish different words, for example, yi

in Mandarin may mean ‘one’, ‘to move’, ‘already’, or ‘art’, depending on the pitch contour. Of over 6000 languages in the world, it is estimated that as many as 60-70% are tonal

[11, 21]. A few of these are national languages (e.g., Mandarin Chinese, Vietnamese, and Thai), but many tonal languages have a small number of speakers and are scarcely documented. There is a limited availability of trained linguists to perform language documentation before these languages become extinct, hence the need for better tools to assist linguists in these tasks.

One of the first tasks during the description of an unfamiliar language is determining its phonemic inventory: what are the consonants, vowels, and tones of the language, and which pairs of phonemes are contrastive? Tone presents a unique challenge because unlike consonants and vowels, which can be identified in isolation, tones do not have a fixed pitch, and vary by speaker and situation. Since tone data is subject to interpretation, different linguists may produce different descriptions of the tone system of the same language [21].

In this work, we present a model to automatically infer phonemic tone categories of a tonal language. We use an unsupervised representation learning and clustering approach, which requires only a set of spoken words in the target language, and produces clusters of syllables that probably have the same tone. We apply our method on Mandarin Chinese and Cantonese datasets, for which the ground truth annotation is used for evaluation. Our method does not make any language-specific assumptions, so it may be applied to low-resource languages whose phonemic inventories are not already established.

1.1 Tone in Mandarin and Cantonese

Figure 1: Pitch contours for the four Mandarin tones and six Cantonese tones in isolation, produced by native speakers. Figure adapted from [5].

Mandarin Chinese (1.1 billion speakers) and Cantonese (74 million speakers) are two tonal languages in the Sinitic family [11]. Mandarin has four lexical tones: high (55), rising (25), low-dipping (214), and falling (51)111The numbers are Chao tone numerals, where 1 is the lowest and 5 is the highest pitch.. The third tone sometimes undergoes sandhi, addressed in section 3. We exclude a fifth, neutral tone, which can only occur in word-final positions and has no fixed pitch.

Cantonese has six lexical tones: high-level (55), mid-rising (25), mid-level (33), low-falling (21), low-rising (23), and low-level (22). Some descriptions of Cantonese include nine tones, of which three are checked tones that are flat, shorter in duration, and only occur on syllables ending in /p/, /t/, or /k/. Since each one of the checked tones are in complementary distribution with an unchecked tone, we adopt the simpler six tone model that treats the checked tones as variants of the high, mid, and low level tones. Contours for the lexical tones in both languages are shown in Figure 1.

2 Related Work

Many low-resource languages lack sufficient transcribed data for supervised speech processing, thus unsupervised models for speech processing is an emerging area of research. The Zerospeech 2015 and 2017 challenges featured unsupervised learning of contrasting phonemes in English and Xitsonga, evaluated by an ABX phoneme discrimination task [19]. One successful approach used denoising and correspondence autoencoders to learn a representation that avoided capturing noise and irrelevant inter-speaker variation [16]. Deep LSTMs for segmenting and clustering phonemes in speech have also been explored in [13] and [12].

In Mandarin Chinese, deep neural networks have been successful for tone classification in isolated syllables

[2] as well as in continuous speech [18, 17]

. Both of these models found that Mel-frequency cepstral coefficients (MFCCs) outperformed pitch contour features, despite the fact that MFCC features do not contain pitch information. In Cantonese, support vector machines (SVMs) have been applied to classify tones in continuous speech, using pitch contours as input


Unsupervised learning of tones remains largely unexplored. Levow [10] performed unsupervised and semi-supervised tone clustering in Mandarin, using average pitch and slope as features, and -means and asymmetric -lines for clustering. Graph-based community detection techniques have been applied to group -grams of contiguous contours into clusters in Mandarin [22]. Our work appears to be the first model to use unsupervised deep neural networks for phonemic tone clustering.

3 Data and Preprocessing

Figure 2: Diagram of our model architecture, consisting of a convolutional autoencoder to learn a latent representation for each pitch contour, and mean shift clustering to identify groups of similar tones.

We use data from Mandarin Chinese and Cantonese. For each language, the data consists of a list of spoken words, recorded by the same speaker. The Mandarin dataset is from a female speaker and is provided by Shtooka222 We use the cmn-caen-tan dataset., and the Cantonese dataset is from a male speaker and is downloaded from Forvo333, an online crowd-sourced pronunciation dictionary. We require all samples within each language to be from the same speaker to avoid the difficulties associated with channel effects and inter-speaker variation. We randomly sample 400 words from each language, which are mostly between 2 and 4 syllables; to reduce the prosody effects with longer utterances, we exclude words longer than 4 syllables.

We extract ground-truth tones for evaluation purposes. In Mandarin, the tones are extracted from the pinyin transcription; in Cantonese, we reference the character entries on Wiktionary444 to retrieve the romanized pronunciation and tones. For Mandarin, we correct for third-tone sandhi (a phonological rule where a pair of consecutive third-tones is always realized as a second-tone followed by a third-tone). We also exclude the neutral tone, which has no fixed pitch and is sometimes thought of as a lack of tone.

3.1 Pitch extraction and syllable segmentation

We use Praat’s autocorrelation-based pitch estimation algorithm to extract the fundamental frequency (F0) contour for each sample, using a minimum frequency of 75Hz and a maximum frequency of 500Hz [1]. The interface between Python and Praat is handled using Parselmouth [7]. We normalize the contour to be between 0 and 1, based on the speaker’s pitch range.

Next, we segment each speech sample into syllables, which is necessary because syllable boundaries are not provided in our datasets. This is done using a simple heuristic that detects continuously voiced segments, and manual annotation where the heuristic fails. To obtain a constant length pitch contour as input to our model, we sample the pitch at 40 equally spaced points. Note that by sampling a variable length contour to a constant length, information about syllable length is lost; this is acceptable because we consider tones which differ on length as variations of the same tone.

Figure 3: Latent space generated by autoencoder and the results of mean shift clustering for Mandarin and Cantonese. Each cluster center is fed through the decoder to generate the corresponding pitch contour. The clusters within each language are ordered by size, from largest to smallest.

4 Model

4.1 Convolutional autoencoder

We use a convolutional autoencoder (Figure 2

) to learn a two-dimensional latent vector for each syllable. Convolutional layers are widely used in computer vision and speech processing to learn spatially local features that are invariant of position. We use a low dimensional latent space so that the model learns to generate a representation that only captures the most important aspects of the input contour, and also because clustering algorithms tend to perform poorly in high dimensional spaces.

Our encoder consists of three layers. The first layer applies 2 convolutional filters (kernel size 4, stride 1) followed by max pooling (kernel size 2) and a tanh activation. The second layer applies 4 convolutional filters (kernel size 4, stride 1), again with max pooling (kernel size 2) and a tanh activation. The third layer is a fully connected layer with two dimensional output. Our decoder is the encoder in reverse, consisting of one fully connected layer and two deconvolution layers, with the same layer shapes as the encoder.

We train the autoencoder using PyTorch


, for 500 epochs, with a batch size of 60. The model is optimized using Adam

[9] with a learning rate of 5e-4 to minimize the mean squared error between the input and output contours.

4.2 Mean shift clustering

We run the encoder on each syllable’s pitch contour to get their latent representations; we apply principal component analysis (PCA) to remove any correlation between the two dimensions. Then, we run mean shift clustering

[4, 6]

, estimating a probability density function in the latent space. The procedure performs gradient ascent on all the points until they converge to a set of stationary points, which are local maxima of the density function. These stationary points are taken to be cluster centers, and points that converge to the same stationary point belong to the same cluster.

Unlike -means clustering, the mean shift procedure does not require the number of clusters to be specified, only a bandwidth parameter (set to 0.6 for our experiments). The cluster centers are always in regions of high density, so they can be viewed as prototypes that represent their respective clusters. Another advantage is that unlike

-means, mean shift clustering is robust to outliers.

Although the mean shift procedure technically assigns every point to a cluster, not all such clusters are linguistically plausible as phonemic tones, because they contain very few points. Thus, we take only clusters larger than a threshold, determined empirically from the distribution of cluster sizes; the rest are considered spurious clusters and we treat them as unclustered. Finally, we feed the remaining cluster centers into the decoder to generate a prototype pitch contour for each cluster.

5 Results

Cluster T1 T2 T3 T4
A 1 163 12 4
B 108 0 0 1
C 0 5 53 31
D 1 0 0 97
N/A 47 30 53 129
Table 1: Cluster and tone frequencies for Mandarin.
Cluster T1 T2 T3 T4 T5 T6
A 5 5 59 109 7 105
B 102 3 36 2 2 7
C 93 0 0 2 0 0
D 0 64 4 3 2 11
E 0 28 2 4 30 2
N/A 70 39 51 45 15 49
Table 2: Cluster and tone frequencies for Cantonese.

Figure 3 shows the latent space learned by the autoencoders and the clustering output. Our model found 4 tone clusters in Mandarin, matching the number of phonemic tones (Table 1) and 5 in Cantonese, which is one fewer than the number of phonemic tones (Table 2). In Mandarin, the 4 clusters correspond very well with the the 4 phonemic tone categories, and the generated contours closely match the ground truth in Figure 1. There is some overlap between tones 3 and 4; this is because tone 3 is sometimes realized a low-falling tone without the final rise, a process known as half T3 sandhi [3], thus, it may overlap with tone 4 (falling tone).

In Cantonese, the 5 clusters A-E correspond to low-falling, mid-level, high-level, mid-rising, and low-rising tones. Tone clustering in Cantonese is expected to be more difficult than in Mandarin because of 6 contrastive tones, rather than 4. The model is more effective at clustering the higher tones (1, 2, 3), and less effective at clustering the lower tones (4, 5, 6), particularly tone 4 (low-falling) and tone 6 (low-level). This confirms the difficulties in prior work, which reported worse classification accuracy on the lower-pitched tones because the lower region of the Cantonese tone space is more crowded than the upper region [15].

Two other sources of error are carry-over and declination effects. A carry-over effect is when the pitch contour of a tone undergoes contextual variation depending on the preceding tone; strong carry-over effects have been observed in Mandarin [20]. Prior work [10] avoided carry-over effects by using only the second half of every syllable, but we do not consider language-specific heuristics in our model. Declination is a phenomenon in which the pitch declines over an utterance [21, 15]. This is especially a problem in Cantonese, which has tones that differ only on pitch level and not contour: for example, a mid-level tone near the end of a phrase may have the same absolute pitch as a low-level tone at the start of a phrase.

First Syllable All Syllables
Mandarin 0.738 0.641
Cantonese 0.515 0.464
Table 3: Normalized mutual information (NMI) between cluster assignments and ground truth tones, considering only the first syllable of each word, or all syllables.

To test this hypothesis, we evaluate the model on only the first syllable of every word, which eliminates carry-over and declination effects (Table 3). In both Mandarin and Cantonese, the clustering is more accurate when using only the first syllables, compared to using all of the syllables.

6 Conclusions and future work

We propose a model for unsupervised clustering and discovery of phonemic tones in tonal languages, using spoken words as input. Our model extracts the F0 pitch contour, trains a convolutional autoencoder to learn a low-dimensional representation for each contour, and applies mean shift clustering to the resulting latent space. We obtain promising results with both Mandarin Chinese and Cantonese, using only 400 spoken words from each language. Cantonese presents more difficulties because of its larger number of tones, especially at the lower half of the pitch range, and also due to multiple contrastive level tones. Finally, we briefly explore the influence of contextual variation on our model.

A limitation of this study is that our model only considers pitch, which is only one aspect of tone. In reality, pitch is determined not only by tone, but by a complex mixture of intonation, stress, and other prosody effects. Tone is not a purely phonetic property – it is impossible to determine on a phonetic basis whether two pitch contours have distinct underlying tones, or are variants of the same underlying tone (perhaps in complementary distribution). Instead, two phonemic tones can be shown to be contrastive only by providing a minimal pair, where two semantically different lexical items are identical in every respect other than their tones. The last problem is not unique to tone: similar difficulties have been noted when attempting to identify consonant and vowel phonemes automatically [8]. In future work, we plan to further explore these issues and develop more nuanced models to learn tone from speech.

7 Acknowledgments

We thank Prof Gerald Penn for his help suggestions during this project. Rudzicz is a CIFAR Chair in AI.


  • [1] P. Boersma (1993) Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound. In Proceedings of the institute of phonetic sciences, Vol. 17, pp. 97–110. Cited by: §3.1.
  • [2] C. Chen, R. C. Bunescu, L. Xu, and C. Liu (2016)

    Tone classification in Mandarin Chinese using convolutional neural networks.

    In INTERSPEECH, pp. 2150–2154. Cited by: §2.
  • [3] M. Y. Chen (2000) Tone sandhi: patterns across Chinese dialects. Vol. 92, Cambridge University Press. Cited by: §5.
  • [4] D. Comaniciu and P. Meer (2002) Mean shift: a robust approach toward feature space analysis. IEEE Transactions on Pattern Analysis & Machine Intelligence (5), pp. 603–619. Cited by: §4.2.
  • [5] A. L. Francis, V. Ciocca, L. Ma, and K. Fenn (2008) Perceptual learning of Cantonese lexical tones by tone and non-tone language speakers. Journal of Phonetics 36 (2), pp. 268–294. Cited by: Figure 1.
  • [6] Y. A. Ghassabeh and F. Rudzicz (2018) Modified mean shift algorithm. IET Image Processing 12 (12), pp. 2172–2177. Cited by: §4.2.
  • [7] Y. Jadoul, B. Thompson, and B. De Boer (2018) Introducing Parselmouth: a Python interface to Praat. Journal of Phonetics 71, pp. 1–15. Cited by: §3.1.
  • [8] T. Kempton and R. K. Moore (2014) Discovering the phoneme inventory of an unwritten language: a machine-assisted approach. Speech Communication 56, pp. 152–166. Cited by: §6.
  • [9] D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. ICLR. Cited by: §4.1.
  • [10] G. Levow (2006)

    Unsupervised and semi-supervised learning of tone and pitch accent

    In Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, pp. 224–231. Cited by: §2, §5.
  • [11] M. P. Lewis (2009) Ethnologue: languages of the world. 16th edition, SIL International, Dallas, Texas. Cited by: §1.1, §1.
  • [12] M. Müller, J. Franke, S. Stüker, and A. Waibel (2017) Improving phoneme set discovery for documenting unwritten languages. Elektronische Sprachsignalverarbeitung (ESSV) 2017. Cited by: §2.
  • [13] M. Müller, J. Franke, A. Waibel, and S. Stüker (2017) Towards phoneme inventory discovery for documentation of unwritten languages. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5200–5204. Cited by: §2.
  • [14] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, Cited by: §4.1.
  • [15] G. Peng and W. S. Wang (2005) Tone recognition of continuous Cantonese speech based on support vector machines. Speech Communication 45 (1), pp. 49–62. Cited by: §2, §5, §5.
  • [16] D. Renshaw, H. Kamper, A. Jansen, and S. Goldwater (2015) A comparison of neural network methods for unsupervised representation learning on the zero resource speech challenge. In Sixteenth Annual Conference of the International Speech Communication Association, Cited by: §2.
  • [17] N. Ryant, M. Slaney, M. Liberman, E. Shriberg, and J. Yuan (2014) Highly accurate Mandarin tone classification in the absence of pitch information. In Proceedings of Speech Prosody, Vol. 7. Cited by: §2.
  • [18] N. Ryant, J. Yuan, and M. Liberman (2014) Mandarin tone classification without pitch tracking. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4868–4872. Cited by: §2.
  • [19] M. Versteegh, R. Thiolliere, T. Schatz, X. N. Cao, X. Anguera, A. Jansen, and E. Dupoux (2015) The zero resource speech challenge 2015. In Sixteenth Annual Conference of the International Speech Communication Association, Cited by: §2.
  • [20] Y. Xu (1997) Contextual tonal variations in Mandarin. Journal of phonetics 25 (1), pp. 61–83. Cited by: §5.
  • [21] M. Yip (2002) Tone. Cambridge University Press. Cited by: §1, §1, §5.
  • [22] S. Zhang (2019) Data mining Mandarin tone contour shapes. SIGMORPHON 2019, pp. 144. Cited by: §2.