Most of the state-of-the-art systems in phoneme recognition system tend to follow the same approach: the task is divided in several sub-tasks, which are optimized in an independent manner. In a first step, the data is transformed into features, usually composed of a dimensionality reduction phase and an information selection phase, based on the task-specific knowledge of the phenomena. These two steps have been carefully hand-crafted, leading to state-of-the-art features such as MFCCs or PLPs. In a second step, the likelihood of sub-sequence units is estimated using generative or discriminative models, making several assumptions, for example units form a Markov chain. In a final step, dynamic programming techniques are used to recognize the sequence under constraints.
Recent advances in machine learning have made possible systems that can be trained in an end-to-end manner, i.e. systems where every step islearned simultaneously, taking into account all the other steps and the final task of the whole system. It is usually referred as deep learning, mainly because such architectures are usually composed of many layers (supposed to provide an increasing level of abstraction), compared to classical “shallow” systems. As opposed to “divide and conquer” approaches presented previously (where each step in independently optimized) deep learning approaches are often claimed to have the potential to lead to more optimal systems, and to have the advantage to alleviate the need of find the right features for a given task of interest. While there is a good success record of such approaches in the computer vision [lecun_gradient-based_1998] or text processing fields [collobert_natural_2011], most speech systems still rely on complex features such as MFCCs, including recent advanced deep learning approaches [mohamed_acoustic_2012]. This contradicts the claim that these techniques have end-to-end learning potential.
In this paper, we propose an end-to-end phoneme sequence recognition system, taking directly raw speech signal as inputs and which outputs a phoneme sequence. The system is composed of two parts: Convolutional Neural Networks [lecun_generalization_1989]
(CNNs) perform the feature learning and classification stages, and a simple version of Conditional Random Field (CRFs) is used for the decoding stage, trained in a end-to-end manner. We show that such system can in fact lead to competitive phoneme recognition systems, even when trained on raw signals. In the framework of hybrid HMM/ANN system, we compare the proposed approach with the conventional approach of extracting spectral-based acoustic feature extraction and then modeling them by ANN. Experimental studies conducted on TIMIT and WSJ corpus show that the proposed approach can yield a phoneme recognition system that is similar to or better than the system based on conventional approach.
The remainder of the paper is organized as follows. Section 2 presents a brief survey of related literature. Section 3 presents the classical HMM/ANN system. Section 4 presents the architecture of the proposed system. Section 5 presents the experimental setup and Section 6 presents the results and the discussion. Section 7 concludes the paper.
2 Related Work
Deep learning architectures have been successfully applied to a wide range of application: characters recognition [lecun_gradient-based_1998], object recognition [lecun_learning_2004]collobert_unified_2008] or image classification [krizhevsky_imagenet_2012]
In speech, one of the first phoneme recognition system based on neural network was the Time Delay Neural Network [waibel_phoneme_1989]. It was extended to isolated word recognition [bottou_experiments_1989]. At the same time, the hybrid HMM/ANN architecture approach [bourlard_links_1990, bengio_connectionist_1993] was developed, leading to more scalable systems.
Recently, the deep belief network[hinton_fast_2006] approach has been found to yield good performance in phone recognition [mohamed_deep_2009]. It was also extend to context-dependent phonemes in [seide_conversational_2011]. However, these systems used complex hand-crafted features, such as MFCC. Later, there has been growing interests in using short-term spectrum as features. These “intermediate” representations (standing between raw signal and “classical” features such as cepstral-based features) have been successfully used in speech recognition applications, for example in unsupervised feature learning [lee_unsupervised_2009], acoustic modeling [mohamed_acoustic_2012] and large vocabulary speech recognition [dahl_context-dependent_2012, hinton_deep_2012]. Convolutional neural networks also yield state-of-the-art results in phoneme recognition [abdel-hamid_applying_2012]. Although these systems are able to learn efficient representations, these features are still used as input for hybrid systems.
The work presented in [jaitly_learning_2011]
successfully investigates features learning from raw speech for phoneme recognition. Here also, these features are used as input for a different system. Finally, convolutional neural networks have shown the capability of learning features from raw speech and estimate phoneme conditional probabilities in a single system[Palaz_INTERSPEECH_2013].
3 Phoneme sequence recognition using hybrid HMM/ANN system
The hybrid HMM/ANN system is one of the most common system for phoneme recognition, presented in Figure 1. It is composed of three parts: features extraction, classification and decoding. In the first step, features are extracted from the signal, by transformation and filtering. The most common ones are Mel Frequency Cepstrum Coefficients (MFCC) [davis_comparison_1980] and Perceptual Linear Prediction (PLP) [hermansky1990perceptual]
. Usually, the first and second derivative of these representations are computed over several surrounding frames and used as additional features. They are then given as input to a Artificial Neural Network (ANN), along with a 4 frames left and 4 frames right context. The network is usually a feed-forward MLP composed of a hidden layer and a output layer, which computes the conditional probabilities for each class. These probabilities are then used as emission probabilities in a Hidden Markov Model (HMM) framework, which decodes the sequence.
4 Proposed system
The proposed system is composed of three stages: the feature learning stage, the modeling stage, which performs the classification, and the decoding of the sequence, as presented in Figure 2. CNNs are used for the first two stages. Their convolutional
aspect make them particularly suitable for handling temporal signals such as raw speech. Moreover, the addition of max-pooling layers, classical layers in deep architecture, brings robustness against temporal distortion to the system and helps control the network capacity. For the third stage, a decoder based on CRFs is proposed, which will learn the transition between the different classes. The back-propagation algorithm is used for training the whole system in an end-to-end manner.
4.1 Convolutional Neural Network
The network is given a sequence of raw input signal, split into frames, and outputs a score for each classes, for each frame. These type of network architectures are composed of several filter extraction stages, followed by a classification stage. A filter extraction stage involves a convolutional layer, followed by a temporal pooling layer and an non-linearity (). Our optimal architecture included 3 stages of filter extraction (see Figure 3). Signal coming out of these stages are fed to a classification stage, which in our case is two linear layers with a large number of hidden units.
4.1.1 Convolutional layer
While “classical” linear layers in standard MLPs accept a fixed-size input vector, a convolution layer is assumed to be fed with a sequence ofvectors/frames:
. A convolutional layer applies the same linear transformation over each successive (or interspaced byframes) windows of frames. E.g, the transformation at frame is formally written as:
where is a matrix of parameters. In other words, filters (rows of the matrix M) are applied to the input sequence. An illustration is provided in Figure 5.
4.1.2 Max-pooling layer
These kind of layers perform local temporal operations over an input sequence, as shown in Figure 5. More formally, the transformation at frame is written as:
These layers increase the robustness of the network to slight temporal distortions in the input.
We consider a simple version of CRFs, where we define a graph with nodes for each frame in the input sequence, and each label. This CRF allows to discriminatively train a simple duration model over our network output scores. Transition scores are assigned to edges between phonemes, and network output scores are assigned to nodes. Given an input data sequence and a label path on the graph , a score for the path can be defined:
where is a matrix describing transitions between labels and the network score of input for class at time , as illustrated in Figure 6
. Note that this approach is a particular case of the forward training of Graph Transformer Network[bottou_global_1997].
At inference time, given a input sequence , the best label path can be found by minimizing (3). The Viterbi algorithm is used to find
4.3 Network training
The network parameters are learned by maximizing the log-likelihood , given by:
for each input speech sequence and label sequence , over the whole training set, with respect to the parameters of each layer . Defining the operation as:
the likelihood can be expressed as:
where is the path score as defined in (3). The number of term in the logadd operation grows exponentially with the length of the input sequence. Using a recursive algorithm, the logadd can be computed in linear time as presented in [collobert_natural_2011]. The likelihood is then maximized using the stochastic gradient ascent algorithm [bottou_stochastic_1991].
5 Experimental Setup
In this section we present the setup used for the experiments, the databases and the hyper-parameters of the networks.
5.1.1 TIMIT Corpus
The TIMIT acoustic-phonetic corpus consists of 3,696 training utterances (sampled at 16kHz) from 462 speakers, excluding the SA sentences. The cross-validation set consists of 400 utterances from 50 speakers. The core test set was used to report the results. It contains 192 utterances from 24 speakers, excluding the validation set. Three different sets of phoneme were used: the first one is composed of 39 phonemes, as presented in [lee_speaker-independent_1989]. The second one has 117 classes (3 states of each of the 39 phonemes). The last one uses 183 classes labels (3 states for each one of the original 61 phonemes). In the last two cases, after decoding, the classes were mapped to the 39 phonemes set for evaluation.
5.1.2 Wall Street Journal Corpus
The SI-284 set of the corpus [woodland_large_1994] is selected for the experiments. The set contains 36416 sequences, representing around 80 hours of speech. Ten percent of the set was taken as validation set. The “Hub 2 2.5k” set was selected as test set. It contains 215 sequences from 10 speakers. The phoneme sequences were extracted from the transcript and segmented using trained GMM. The CMU phoneme set, containing 40 classes, was used.
5.2 Baseline system
The baseline system is a standard HMM/ANN system [morgan_continuous_1995]. For the features, MFCC were used. They were computed (with HTK [htk]) using a 25 ms Hamming window on the speech signal, with a shift of 10 ms. The signal is represented using 13th-order coefficients along with their first and second derivatives, computed on a 9 frames context. The classifier is a two-layer MLP. The decoding of the sequence was performed by a standard HMM decoder, with constrained duration of 3 states, and considering all phoneme equally probable.
5.3 Network hyper-parameters
The hyper-parameters of the network are: the input window size, corresponding to the context taken along with each example, the number for sample for each example, the kernel width and shift of the convolutions, the number of filters , the width of the hidden layer and the pooling width. They were tuned by early-stopping on the cross-validation set. Ranges which were considered for the grid search are reported in Table 1. It is worth mentioning that for a given input window size over the raw signal, the size of the output of the filter extraction stage will strongly depend on the number of max-pooling layers, each of them dividing the output size of the filter stage by the chosen pooling kernel width. As a result, adding pooling layers reduces the input size of the classification stage, which in returns reduces the number of parameters of the network (as most parameters do lie in the classification stage).
The best performance for TIMIT corpus on the cross-validation set for the first phoneme set was found with: ms duration for each example, ms of context, , and frames kernel width, , and frames shift, filters, hidden units and pooling width. For the second set: ms duration for each example, ms of context, , and frames kernel width, , and frames shift, filters, hidden units and pooling width. For the third set: ms duration for each example, ms of context, , and frames kernel width, , and frames shift, filters, hidden units and pooling width. For the WSJ corpus: ms duration for each example, ms of context, , and frames kernel width, , and frames shift, filters, hidden units and pooling width. For the baseline, early stopping on the cross-validation set was also used to determine the optimal number of nodes ( and nodes were found). The experiments were implemented using the torch7 toolbox [collobert_torch7_2011].
|Input window size (ms)||100-700|
|Example duration (ms)||5-15|
|Kernel width ()||1-9|
|Number of filters per kernel ()||10-90|
|Number of hidden units in the class. stage||100-1500|
6 Results and discussion
The results are given in term of phoneme recognition accuracy, which is the Levenstein distance between the reference and the inference phoneme sequence. They are presented in Table 2 for the TIMIT corpus for the three phoneme sets, along with the number of parameters of the network. For comparison purposes, the accuracy using larger phoneme set is computed by mapping the classes to the first phoneme set. The proposed system yields better performance than the baseline on the TIMIT database. Using a larger number of classes and three states by phoneme also improves the performance. The results for the larger database, the WSJ corpus, are presented in Table 3. The proposed system also yields similar performances to the baseline, showing its scalability capacities.
The key difference between systems is that, for the proposed system, almost no prior knowledge on the data was used (except that the data is a temporal signal) and still achieve similar performances. This suggest that the deep network can learn relevant features, thus questioning the use of complex hand-crafted features. Moreover, by learning the transition between phoneme, the CRF is able to learn a duration model directly on training data, without any external constraints.
The end-to-end aspect of the proposed system makes it interesting for a stand-alone implementation of a phoneme recognizer, as the system takes sequences of raw speech as input and outputs phoneme sequences. Moreover, the speed of the system at inference makes it suitable for real-time phoneme recognition. A demo will be shown at the time of the conference.
In this paper, we proposed an end-to-end phoneme recognition system, which is able to learn the feature by taking raw speech data as input and yield similar performances as baseline systems. As future work, we plan to improve the current system by investigating deeper architectures or constrained CRF. We will also extend it to context-dependent phonemes, therefore having more classes, which might lead to better performances, as Table 2 suggests. From there, we will focus on developing more specific applications, such as Spoken Term Detection.
This work was partly supported by the HASLER foundation through the grant “Universal Spoken Term Detection with Deep Learning” (DeepSTD) and by the Swiss NSF through the Swiss National Center of Competence in Research (NCCR) on Interactive Multimodal Information Management (www.im2.ch).