End-to-end Phoneme Sequence Recognition using Convolutional Neural Networks

12/07/2013 ∙ by Dimitri Palaz, et al. ∙ Idiap Research Institute 0

Most phoneme recognition state-of-the-art systems rely on a classical neural network classifiers, fed with highly tuned features, such as MFCC or PLP features. Recent advances in "deep learning" approaches questioned such systems, but while some attempts were made with simpler features such as spectrograms, state-of-the-art systems still rely on MFCCs. This might be viewed as a kind of failure from deep learning approaches, which are often claimed to have the ability to train with raw signals, alleviating the need of hand-crafted features. In this paper, we investigate a convolutional neural network approach for raw speech signals. While convolutional architectures got tremendous success in computer vision or text processing, they seem to have been let down in the past recent years in the speech processing field. We show that it is possible to learn an end-to-end phoneme sequence classifier system directly from raw signal, with similar performance on the TIMIT and WSJ datasets than existing systems based on MFCC, questioning the need of complex hand-crafted features on large datasets.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Most of the state-of-the-art systems in phoneme recognition system tend to follow the same approach: the task is divided in several sub-tasks, which are optimized in an independent manner. In a first step, the data is transformed into features, usually composed of a dimensionality reduction phase and an information selection phase, based on the task-specific knowledge of the phenomena. These two steps have been carefully hand-crafted, leading to state-of-the-art features such as MFCCs or PLPs. In a second step, the likelihood of sub-sequence units is estimated using generative or discriminative models, making several assumptions, for example units form a Markov chain. In a final step, dynamic programming techniques are used to recognize the sequence under constraints.

Recent advances in machine learning have made possible systems that can be trained in an end-to-end manner, i.e. systems where every step is

learned simultaneously, taking into account all the other steps and the final task of the whole system. It is usually referred as deep learning, mainly because such architectures are usually composed of many layers (supposed to provide an increasing level of abstraction), compared to classical “shallow” systems. As opposed to “divide and conquer” approaches presented previously (where each step in independently optimized) deep learning approaches are often claimed to have the potential to lead to more optimal systems, and to have the advantage to alleviate the need of find the right features for a given task of interest. While there is a good success record of such approaches in the computer vision [lecun_gradient-based_1998] or text processing fields [collobert_natural_2011], most speech systems still rely on complex features such as MFCCs, including recent advanced deep learning approaches [mohamed_acoustic_2012]. This contradicts the claim that these techniques have end-to-end learning potential.

In this paper, we propose an end-to-end phoneme sequence recognition system, taking directly raw speech signal as inputs and which outputs a phoneme sequence. The system is composed of two parts: Convolutional Neural Networks [lecun_generalization_1989]

(CNNs) perform the feature learning and classification stages, and a simple version of Conditional Random Field (CRFs) is used for the decoding stage, trained in a end-to-end manner. We show that such system can in fact lead to competitive phoneme recognition systems, even when trained on raw signals. In the framework of hybrid HMM/ANN system, we compare the proposed approach with the conventional approach of extracting spectral-based acoustic feature extraction and then modeling them by ANN. Experimental studies conducted on TIMIT and WSJ corpus show that the proposed approach can yield a phoneme recognition system that is similar to or better than the system based on conventional approach.

The remainder of the paper is organized as follows. Section 2 presents a brief survey of related literature. Section 3 presents the classical HMM/ANN system. Section 4 presents the architecture of the proposed system. Section 5 presents the experimental setup and Section 6 presents the results and the discussion. Section 7 concludes the paper.

2 Related Work

Deep learning architectures have been successfully applied to a wide range of application: characters recognition [lecun_gradient-based_1998], object recognition [lecun_learning_2004]

, natural language processing

[collobert_unified_2008] or image classification [krizhevsky_imagenet_2012]

In speech, one of the first phoneme recognition system based on neural network was the Time Delay Neural Network [waibel_phoneme_1989]. It was extended to isolated word recognition [bottou_experiments_1989]. At the same time, the hybrid HMM/ANN architecture approach [bourlard_links_1990, bengio_connectionist_1993] was developed, leading to more scalable systems.

Recently, the deep belief network

[hinton_fast_2006] approach has been found to yield good performance in phone recognition [mohamed_deep_2009]. It was also extend to context-dependent phonemes in [seide_conversational_2011]. However, these systems used complex hand-crafted features, such as MFCC. Later, there has been growing interests in using short-term spectrum as features. These “intermediate” representations (standing between raw signal and “classical” features such as cepstral-based features) have been successfully used in speech recognition applications, for example in unsupervised feature learning [lee_unsupervised_2009], acoustic modeling [mohamed_acoustic_2012] and large vocabulary speech recognition [dahl_context-dependent_2012, hinton_deep_2012]. Convolutional neural networks also yield state-of-the-art results in phoneme recognition [abdel-hamid_applying_2012]. Although these systems are able to learn efficient representations, these features are still used as input for hybrid systems.

The work presented in [jaitly_learning_2011]

successfully investigates features learning from raw speech for phoneme recognition. Here also, these features are used as input for a different system. Finally, convolutional neural networks have shown the capability of learning features from raw speech and estimate phoneme conditional probabilities in a single system 


3 Phoneme sequence recognition using hybrid HMM/ANN system

The hybrid HMM/ANN system is one of the most common system for phoneme recognition, presented in Figure 1. It is composed of three parts: features extraction, classification and decoding. In the first step, features are extracted from the signal, by transformation and filtering. The most common ones are Mel Frequency Cepstrum Coefficients (MFCC) [davis_comparison_1980] and Perceptual Linear Prediction (PLP) [hermansky1990perceptual]

. Usually, the first and second derivative of these representations are computed over several surrounding frames and used as additional features. They are then given as input to a Artificial Neural Network (ANN), along with a 4 frames left and 4 frames right context. The network is usually a feed-forward MLP composed of a hidden layer and a output layer, which computes the conditional probabilities for each class. These probabilities are then used as emission probabilities in a Hidden Markov Model (HMM) framework, which decodes the sequence.

Figure 1: Hybrid ANN/HMM phoneme recognition

4 Proposed system

The proposed system is composed of three stages: the feature learning stage, the modeling stage, which performs the classification, and the decoding of the sequence, as presented in Figure 2. CNNs are used for the first two stages. Their convolutional

aspect make them particularly suitable for handling temporal signals such as raw speech. Moreover, the addition of max-pooling layers, classical layers in deep architecture, brings robustness against temporal distortion to the system and helps control the network capacity. For the third stage, a decoder based on CRFs is proposed, which will learn the transition between the different classes. The back-propagation algorithm is used for training the whole system in an end-to-end manner.

Figure 2: End-to-end phoneme recognition system

4.1 Convolutional Neural Network

The network is given a sequence of raw input signal, split into frames, and outputs a score for each classes, for each frame. These type of network architectures are composed of several filter extraction stages, followed by a classification stage. A filter extraction stage involves a convolutional layer, followed by a temporal pooling layer and an non-linearity (). Our optimal architecture included 3 stages of filter extraction (see Figure 3). Signal coming out of these stages are fed to a classification stage, which in our case is two linear layers with a large number of hidden units.

Figure 3: Convolutional Neural Network for one frame classification. Several stages of convolution/pooling/tanh might be considered. Our network included 3 stages.

4.1.1 Convolutional layer

While “classical” linear layers in standard MLPs accept a fixed-size input vector, a convolution layer is assumed to be fed with a sequence of


. A convolutional layer applies the same linear transformation over each successive (or interspaced by

frames) windows of frames. E.g, the transformation at frame is formally written as:


where is a matrix of parameters. In other words, filters (rows of the matrix M) are applied to the input sequence. An illustration is provided in Figure 5.

Figure 4: Illustration of a convolutional layer. and are the dimension of the input and output frames. is the kernel width (here ) and is the shift between two linear applications (here, ).
Figure 5: Illustration of max-pooling layer. is the number of frame taken for each operation and represents the dimension of input/output frames (which are equal).

4.1.2 Max-pooling layer

These kind of layers perform local temporal operations over an input sequence, as shown in Figure 5. More formally, the transformation at frame is written as:


These layers increase the robustness of the network to slight temporal distortions in the input.

4.2 Decoding

We consider a simple version of CRFs, where we define a graph with nodes for each frame in the input sequence, and each label. This CRF allows to discriminatively train a simple duration model over our network output scores. Transition scores are assigned to edges between phonemes, and network output scores are assigned to nodes. Given an input data sequence and a label path on the graph , a score for the path can be defined:


where is a matrix describing transitions between labels and the network score of input for class at time , as illustrated in Figure 6

. Note that this approach is a particular case of the forward training of Graph Transformer Network 


At inference time, given a input sequence , the best label path can be found by minimizing (3). The Viterbi algorithm is used to find

Figure 6: Illustration of the CRF graph.

4.3 Network training

The network parameters are learned by maximizing the log-likelihood , given by:


for each input speech sequence and label sequence , over the whole training set, with respect to the parameters of each layer . Defining the operation as:


the likelihood can be expressed as:


where is the path score as defined in (3). The number of term in the logadd operation grows exponentially with the length of the input sequence. Using a recursive algorithm, the logadd can be computed in linear time as presented in [collobert_natural_2011]. The likelihood is then maximized using the stochastic gradient ascent algorithm [bottou_stochastic_1991].

5 Experimental Setup

In this section we present the setup used for the experiments, the databases and the hyper-parameters of the networks.

5.1 Databases

5.1.1 TIMIT Corpus

The TIMIT acoustic-phonetic corpus consists of 3,696 training utterances (sampled at 16kHz) from 462 speakers, excluding the SA sentences. The cross-validation set consists of 400 utterances from 50 speakers. The core test set was used to report the results. It contains 192 utterances from 24 speakers, excluding the validation set. Three different sets of phoneme were used: the first one is composed of 39 phonemes, as presented in [lee_speaker-independent_1989]. The second one has 117 classes (3 states of each of the 39 phonemes). The last one uses 183 classes labels (3 states for each one of the original 61 phonemes). In the last two cases, after decoding, the classes were mapped to the 39 phonemes set for evaluation.

5.1.2 Wall Street Journal Corpus

The SI-284 set of the corpus [woodland_large_1994] is selected for the experiments. The set contains 36416 sequences, representing around 80 hours of speech. Ten percent of the set was taken as validation set. The “Hub 2 2.5k” set was selected as test set. It contains 215 sequences from 10 speakers. The phoneme sequences were extracted from the transcript and segmented using trained GMM. The CMU phoneme set, containing 40 classes, was used.

5.2 Baseline system

The baseline system is a standard HMM/ANN system [morgan_continuous_1995]. For the features, MFCC were used. They were computed (with HTK [htk]) using a 25 ms Hamming window on the speech signal, with a shift of 10 ms. The signal is represented using 13th-order coefficients along with their first and second derivatives, computed on a 9 frames context. The classifier is a two-layer MLP. The decoding of the sequence was performed by a standard HMM decoder, with constrained duration of 3 states, and considering all phoneme equally probable.

5.3 Network hyper-parameters

The hyper-parameters of the network are: the input window size, corresponding to the context taken along with each example, the number for sample for each example, the kernel width and shift of the convolutions, the number of filters , the width of the hidden layer and the pooling width. They were tuned by early-stopping on the cross-validation set. Ranges which were considered for the grid search are reported in Table 1. It is worth mentioning that for a given input window size over the raw signal, the size of the output of the filter extraction stage will strongly depend on the number of max-pooling layers, each of them dividing the output size of the filter stage by the chosen pooling kernel width. As a result, adding pooling layers reduces the input size of the classification stage, which in returns reduces the number of parameters of the network (as most parameters do lie in the classification stage).

The best performance for TIMIT corpus on the cross-validation set for the first phoneme set was found with: ms duration for each example, ms of context, , and frames kernel width, , and frames shift, filters, hidden units and pooling width. For the second set: ms duration for each example, ms of context, , and frames kernel width, , and frames shift, filters, hidden units and pooling width. For the third set: ms duration for each example, ms of context, , and frames kernel width, , and frames shift, filters, hidden units and pooling width. For the WSJ corpus: ms duration for each example, ms of context, , and frames kernel width, , and frames shift, filters, hidden units and pooling width. For the baseline, early stopping on the cross-validation set was also used to determine the optimal number of nodes ( and nodes were found). The experiments were implemented using the torch7 toolbox [collobert_torch7_2011].

Parameters Range
Input window size (ms) 100-700
Example duration (ms) 5-15
Kernel width () 1-9
Number of filters per kernel () 10-90
Number of hidden units in the class. stage 100-1500
Table 1: Network hyper-parameters

6 Results and discussion

The results are given in term of phoneme recognition accuracy, which is the Levenstein distance between the reference and the inference phoneme sequence. They are presented in Table 2 for the TIMIT corpus for the three phoneme sets, along with the number of parameters of the network. For comparison purposes, the accuracy using larger phoneme set is computed by mapping the classes to the first phoneme set. The proposed system yields better performance than the baseline on the TIMIT database. Using a larger number of classes and three states by phoneme also improves the performance. The results for the larger database, the WSJ corpus, are presented in Table 3. The proposed system also yields similar performances to the baseline, showing its scalability capacities.

The key difference between systems is that, for the proposed system, almost no prior knowledge on the data was used (except that the data is a temporal signal) and still achieve similar performances. This suggest that the deep network can learn relevant features, thus questioning the use of complex hand-crafted features. Moreover, by learning the transition between phoneme, the CRF is able to learn a duration model directly on training data, without any external constraints.

The end-to-end aspect of the proposed system makes it interesting for a stand-alone implementation of a phoneme recognizer, as the system takes sequences of raw speech as input and outputs phoneme sequences. Moreover, the speed of the system at inference makes it suitable for real-time phoneme recognition. A demo will be shown at the time of the conference.

System #Classes #Param. Test acc.
Baseline 39 196’040 66.65 %
CNN+CRF 39 873’340 65.81 %
CNN+CRF 117 986’680 67.84 %
CNN+CRF 183 803’363 70.08 %
Table 2: Phoneme recognition accuracy on the core test set of TIMIT corpus.
System #Classes #Param. Test acc.
Baseline 39 1’786’440 72.39 %
CNN+CRF 39 6’573’440 72.88 %
Table 3: Phoneme recognition accuracy on the ‘Hub 2 2.5k” test set of WSJ corpus.

7 Conclusion

In this paper, we proposed an end-to-end phoneme recognition system, which is able to learn the feature by taking raw speech data as input and yield similar performances as baseline systems. As future work, we plan to improve the current system by investigating deeper architectures or constrained CRF. We will also extend it to context-dependent phonemes, therefore having more classes, which might lead to better performances, as Table 2 suggests. From there, we will focus on developing more specific applications, such as Spoken Term Detection.


This work was partly supported by the HASLER foundation through the grant “Universal Spoken Term Detection with Deep Learning” (DeepSTD) and by the Swiss NSF through the Swiss National Center of Competence in Research (NCCR) on Interactive Multimodal Information Management (www.im2.ch).