Deep Triphone Embedding Improves Phoneme Recognition
In this paper, we present a novel Deep Triphone Embedding (DTE) representation derived from Deep Neural Network (DNN) to encapsulate the discriminative information present in the adjoining speech frames. DTEs are generated using a four hidden layer DNN with 3000 nodes in each hidden layer at the first-stage. This DNN is trained with the tied-triphone classification accuracy as an optimization criterion. Thereafter, we retain the activation vectors (3000) of the last hidden layer, for each speech MFCC frame, and perform dimension reduction to further obtain a 300 dimensional representation, which we termed as DTE. DTEs along with MFCC features are fed into a second-stage four hidden layer DNN, which is subsequently trained for the task of tied-triphone classification. Both DNNs are trained using tri-phone labels generated from a tied-state triphone HMM-GMM system, by performing a forced-alignment between the transcriptions and MFCC feature frames. We conduct the experiments on publicly available TED-LIUM speech corpus. The results show that the proposed DTE method provides an improvement of absolute 2.11 phoneme recognition, when compared with a competitive hybrid tied-state triphone HMM-DNN system.
READ FULL TEXT