1 Introduction
During the last decade, deep neural networks (DNN) have encountered a wide success in numerous domain applications. In particular, automatic speech recognition systems (ASR) performances have been remarkably improved with the emergence of DNNs. Among them, recurrent neural networks [1] (RNN) have been shown to effectively encode input sequences, increasing the accuracy of neural network based ASR systems [2]. Nonetheless, vanilla RNNs suffer from vanishing/exploding issues [3], or the lack of a memory mechanism to remember patterns in verylong or short sequences. These problems have been alleviated by the introduction of longshort term memory (LSTM) RNN [4] with gates mechanism that allows the model to update or forget information in memory cells, and to select the content cell state to expose in a network hidden state. LSTMs have reached stateofthe art performances in many benchmarks [4, 5], and are widely employed in recent ASR models, with the almost unchanged acoustic input features used in previous systems.
Traditional ASR systems rely on multidimensional acoustic features such as the Mel filter bank energies alongside with the first, and second order time derivatives to characterize timeframes that compose the signal sequence. Considering that these components describe three different views of the same element, neural networks have to learn both the internal relations that exist within these views, and external or global dependencies that exist between the timeframes. Such concerns are partially addressed by increasing the learning capacity of neural network architectures. Nonetheless, even with a huge set of free parameters, it is not certain that both local and global dependencies are properly represented. To address this problem, new quaternionvalued neural networks, based on a highdimensional algebra, are proposed in this paper.
Quaternions are hypercomplex numbers that contain a real and three separate imaginary components, fitting perfectly to three and four dimensional feature vectors, such as for image processing and robot kinematics
[6, 7]. The idea of bundling groups of numbers into separate entities is also exploited by the recent capsule network [8]. With quaternion numbers, LSTMs are conceived to encode latent interdependencies between groups of input features during the learning process with less parameters than realvalued LSTMs, by taking advantage of the use of the quaternion Hamilton productas the counterpart of the dot product. Early applications of quaternionvalued backpropagation algorithms
[9, 10] have efficiently shown that quaternion neural networks can approximate quaternionvalued functions. More recently, neural networks of hypercomplex numbers have received an increasing attention, and some efforts have shown promising results in different applications. In particular, a deep quaternion network [11, 12], a deep quaternion convolutional network [13, 14], or a quaternion recurrent neural network [15] have been successfully employed for challenging tasks such as images, speech and language processing. For speech recognition, in [14], quaternions with only three internal features have been used to encode input speech. An additional internal feature is proposed in this paper to obtain a richer representation with the same number of model parameters.Based on all the above considerations, the contributions of this paper can be summarized as follows: 1) The introduction of a novel model, called bidirectional quaternion longshort term memory neural network (QLSTM)^{1}^{1}1Code is available at https://github.com/OrkisResearch/PytorchQuaternionNeuralNetworks, that avoids known RNN problems also present in quaternion RNNs, and shows that QLSTMs achieve top of the line results on speech recognition; 2) The introduction of a novel input quaternion that integrates four views of speech time frames. The model is first evaluated on a synthetic memory copytask to ensure that the introduction of quaternion into the LSTM model does not alter the basic properties of RNNs. Then, QLSTMs are compared to realvalued LSTMs on a realistic speech recognition task with the Wall Street Journal (WSJ) dataset. The reported results show that the QLSTM outperforms the LSTM in both tasks with a higher longmemory capability on the memory task, a better generalization performance with better word error rates (WER), and a maximum reduction of the number of neural paramaters of times compared to realvalued LSTM.
2 Quaternion Algebra
The quaternion algebra defines operations between quaternion numbers. A quaternion Q is an extension of a complex number defined in a four dimensional space as:
(1) 
where , , , and are real numbers, and , i, j, and k are the quaternion unit basis. In a quaternion, is the real part, while with is the imaginary part, or the vector part. Such a definition can be used to describe spatial rotations. The Hamilton product between two quaternions and is computed as follows:
(2) 
The Hamilton product
is used in QLSTMs to perform transformations of vectors representing quaternions, as well as scaling and interpolation between two rotations following a geodesic over a sphere in the
space as shown in [16].3 Quaternion longshort term memory neural networks
Based on the quaternion algebra and with the previously described motivations, we introduce the quaternion longshort term memory (QLSTM) recurrent neural network. In a quaternion dense layer, all parameters are quaternions, including inputs, outputs, weights and biases. The quaternion algebra is ensured by manipulating matrices of real numbers [14] to reconstruct the Hamilton product from quaternion algebra. Consequently, for each input vector of size , output vector of size , dimensions are split in four parts: the first one equals to , the second to i, the third one is j, and the last one equals to k. The inference process of a fullyconnected layer is defined in the realvalued space by the dot product between an input vector and a realvalued weight matrix. In a QLSTM, this operation is replaced with the Hamilton product ’’ (Eq. 2) with quaternionvalued matrices (i.e. each entry in the weight matrix is a quaternion).
Gates are core components of the memory of LSTMs. Based on [17], we propose to extend this mechanism to quaternion numbers. Therefore, the gate action is characterized by an independent modification of each component of the quaternionvalued signal following a componentwise product (i.e. in a split fashion [18]) with the quaternionvalued gate potential. Let ,, , , and be the forget, input, output gates, cell states and the hidden state of a LSTM cell at timestep . QLSTM equations can be derived as:
(3)  
(4)  
(5)  
(6)  
(7) 
with and the sigmoid and tanh quaternion split activations [18, 11, 19, 10]. The quaternion weight and bias matrices are initialized following the proposal of [15]. Quaternion bidirectional connections are equivalent to realvalued ones [20]. Consequently, past and future contexts are added together componentwise at each timestep. The full backpropagtion of quaternionvalued recurrent neural network can be found in [15].
4 Experiments
This section provides the results for QLSTM and LSTM on the synthetic memory copytask (Section 4.1), and a description of the quaternion acoustic features (Section 4.2) that are used as inputs during the realistic speech recognition experiment with the Wall Street Journal (WSJ) corpus (Section 4.3).
4.1 Synthetic memory copytask as a sanity check
The copy task originally introduced by [21] is a synthetic test that highlights how RNN based models manage the longterm memory. This characteristic makes the copy task a powerful benchmark to demonstrate that a recurrent model can learn longterm dependencies. It consists of an input sequence of a length , composed of different symbols followed by a sequence of timelags or blanks of size , and ended by a delimiter that announces the beginning of the copy operation (after which the initial input sequence should be progressively reconstructed at the output). In this paper, the copytask is used as a sanity check to ensure that the introduction of quaternions on LSTM models does not harm the basic memorization abilities of the LSTM. The QLSTM is composed of K parameters with one hidden layer of size , while the LSTM is made of K parameters with an hidden dimension of neurons. It is worth underlying that due to the nature of the task, the output layer of the QLSTM is realvalued. Indeed,
symbols are onehot encoded (
for the sequence and for the blank) and can not be split in four components. Different values of are investigated alongside with a fixed sequence size of . Models are trained with the Adam optimizer, with an initial learning rate , and without employing any regularization methods. The training is performed on epochs with the crossentropy used as the loss function. At each epoch, models are fed with a batch of randomly generated sequences.The results reported in Fig.1 highlight a slightly faster convergence of the QLSTM over the LSTM for all sizes (). It is also worth noticing that realvalued LSTM failed the copytask with while QLSTM succeeded. It is easily explained by the impact of quaternion numbers during the learning process of interdenpendencies of input features. Indeed, the QLSTM is a smaller (less parameters), but more efficient (dealing with higher dimensions) model than realvalued LSTM, resulting in a higher generalization capability: quaternion neurons are equivalent to realvalued ones. Overall, the introduction of quaternions in LSTMs do not alter their basics properties, but it provides a higher longterm dependencies learning capability. We hypothesis that such efficiency improvements alongside with a dedicated input representation will help QLSTMs to outperform LSTMs in more realistic tasks, such as speech recognition.
Models  WSJ14 Dev.  WSJ14 Test  WSJ81 Dev.  WSJ81 Test  Params 

LSTM3L256  12.7  8.6  9.5  6.5  4.0M 
QLSTM3L256  12.8  8.5  9.4  6.5  2.3M 
LSTM4L256  12.1  8.3  9.3  6.4  4.8M 
QLSTM4L256  11.9  8.0  9.1  6.2  2.5M 
LSTM3L512  11.1  7.1  8.2  5.2  12.2M 
QLSTM3L512  10.9  6.9  8.1  5.1  5.6M 
LSTM4L512  11.3  7.0  8.1  5.0  15.5M 
QLSTM4L512  11.1  6.8  8.0  4.9  6.5M 
LSTM3L1024  11.4  7.3  7.6  4.8  41.2M 
QLSTM3L1024  11.0  6.9  7.4  4.6  15.5M 
LSTM4L1024  11.2  7.2  7.4  4.5  53.7M 
QLSTM4L1024  10.9  6.9  7.2  4.3  18.7M 
4.2 Quaternion acoustic features
Unlike in [14], this paper proposes to use four internal features in an input quaternion. The raw audio is first split every ms with a window of ms. Then dimensional log Melfilterbank coefficients with first, second, and third order derivatives are extracted using the pytorchkaldi^{2}^{2}2pytorchkaldi is available at https://github.com/mravanelli/pytorchkaldi toolkit and the Kaldi s5 recipes [2]. An acoustic quaternion associated with a frequency band and a timeframe is formed as follows:
(8) 
represents multiple views of a frequency band at time frame , consisting of the energy in the filter band at frequency , its first time derivative describing a slope view, its second time derivative describing a concavity view, and the third derivative describing the rate of change of the second derivative. Quaternions are used to construct latent representations of the external relations between the views characterizing the contents of frequency bands at given time intervals. Thus, the quaternion input vector length is . Decoding is based on Kaldi [2]
and weighted finite state transducers (WFST) that integrate acoustic, lexicon and language model probabilities into a single HMMbased search graph.
4.3 Speech recognition with the Wall Street Journal
QLSTMs and LSTMs are trained on both the hour subset ‘trainsi84’, and the full hour dataset ’trainsi284’ of the Wall Street Journal (WSJ) corpus. The ‘testdev93’ development set is employed for validation, while ’testeval92’ composes the testing set. It is important to notice that evaluated LSTMs and QLSTMs are bidirectionals. Architecture models vary in both number of layers and neurons. Indeed the number of recurrent layers varies from three to four, while the number of neurons is included in a gap from to . Then, one dense layer is stacked alongside with an output dense layer. It is also worth noticing that the number of quaternion units of a QLSTM layer is . Indeed, QLSTM neurons are four dimensional (i.e. a QLSTM layer that deals with a dimension size of has effective quaternion neurons). Models are optimized with Adam, with vanilla hyperparameters and an initial learning rate of . The learning rate is progressively annealed using an halving factor of that is applied when no performance improvement on the validation set is observed. The models are trained during epochs. All the models converged to a minimum loss, due to the annealed learning rate. Results are from a three folds average.
At first, it is important to notice that reported results on Table 1 compare favorably with equivalent architectures [5] (WER of on ’testdev93’), and are competitive with stateoftheart and much more complex models based on better engineered features [22] (WER of with the 81 hours of training data, and on ’testeval92’). Table 1 shows that the proposed QLSTM always outperform realvalued LSTM on the test dataset with less neural parameters. Based on the smallest hours subset, a best WER of is reported in real conditions (w.r.t to the best validation set results) with a three layered QLSTM of size , compared to for an LSTM with the same size. It is worth mentioning that a best WER of is obtained with a four layered QLSTM of size , but without consideration for the validation results. Such performances are obtained with a reduction of the number of parameters of times, with M parameters for the QLSTM compared to M for the realvalued equivalent. This is easily explained by considering the content of the quaternion algebra. Indeed, for a fullyconnected layer with input values and hidden units, a realvalued RNN has M parameters, while, to maintain equal input and output dimensions, the quaternion equivalent has quaternions inputs and quaternion hidden units. Therefore, the number of parameters for the quaternionvalued model is M. Such a complexity reduction turns out to produce better results and have other advantages such as a smaller memory footprint while saving models on budget memory systems. This reduction allows the QLSTM to make the memory more “compact” and therefore, the relations between quaternion components are more robust to unseen documents from both validation and testing datasets. This characteristic makes our QLSTM model particularly suitable for speech recognition conducted on low computational power devices like smartphones. Both QLSTMs and LSTMs produce better results with the hours of training data. As for the smaller subset, QLSTMs always outperform LSTMs during both validation and testing phases. Indeed, a best WER of % is reported for a four layered QLSTM of dimension , while the best LSTM performed at % with times more parameters, and an equivalently sized architecture.
5 Conclusion
This paper proposes to process sequences of traditional and multidimensional acoustic features with a novel quaternion longshort term memory neural network (QLSTM). The paper introduce first a novel quaternionvalued representation of the speech signal to better handle signal sequences dependencies, and a LSTM composed with quaternions to represent in the hidden latent space interdependencies between quaternion features. The proposed model has been evaluated on a synthetic memory copytask and a more realistic speech recognition task with the large Wall Street Journal (WSJ) dataset. The reported results support the initial intuitions by showing that QLSTM are more effective at learning both longer dependencies and a compact representation of multidimensional acoustic speech features by outperforming standard realvalued LSTMs on both experiments, with up to times less neural parameters. Therefore, and as for other quaternionvalued architectures, the intuition that the quaternion algebra of the QLSTM offers a better and more compact representation for multidimensional features, alongside with a better learning capability of feature internal dependencies through the Hamilton product, have been validated.
References
 [1] Larry R. Medsker and Lakhmi J. Jain, “Recurrent neural networks,” Design and Applications, vol. 5, 2001.
 [2] Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, Jan Silovsky, Georg Stemmer, and Karel Vesely, “The kaldi speech recognition toolkit,” in IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. Dec. 2011, IEEE Signal Processing Society, IEEE Catalog No.: CFP11SRWUSB.

[3]
Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio,
“On the difficulty of training recurrent neural networks,”
in
International Conference on Machine Learning
, 2013, pp. 1310–1318.  [4] Klaus Greff, Rupesh K Srivastava, Jan Koutník, Bas R Steunebrink, and Jürgen Schmidhuber, “Lstm: A search space odyssey,” IEEE transactions on neural networks and learning systems, vol. 28, no. 10, pp. 2222–2232, 2017.
 [5] Alex Graves, Navdeep Jaitly, and Abdelrahman Mohamed, “Hybrid speech recognition with deep bidirectional lstm,” in Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on. IEEE, 2013, pp. 273–278.

[6]
Stephen John Sangwine,
“Fourier transforms of colour images using quaternion or hypercomplex, numbers,”
Electronics letters, vol. 32, no. 21, pp. 1979–1980, 1996.  [7] Nicholas A Aspragathos and John K Dimitros, “A comparative study of three methods for robot kinematics,” Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on, vol. 28, no. 2, pp. 135–145, 1998.
 [8] Sara Sabour, Nicholas Frosst, and Geoffrey E Hinton, “Dynamic routing between capsules,” arXiv preprint arXiv:1710.09829v2, 2017.
 [9] Paolo Arena, Luigi Fortuna, Luigi Occhipinti, and Maria Gabriella Xibilia, “Neural networks for quaternionvalued function approximation,” in Circuits and Systems, ISCAS’94., IEEE International Symposium on. IEEE, 1994, vol. 6, pp. 307–310.

[10]
Paolo Arena, Luigi Fortuna, Giovanni Muscato, and Maria Gabriella Xibilia,
“Multilayer perceptrons to approximate quaternion valued functions,”
Neural Networks, vol. 10, no. 2, pp. 335–342, 1997.  [11] Titouan Parcollet, Mohamed Morchid, PierreMichel Bousquet, Richard Dufour, Georges Linarès, and Renato De Mori, “Quaternion neural networks for spoken language understanding,” in Spoken Language Technology Workshop (SLT), 2016 IEEE. IEEE, 2016, pp. 362–368.
 [12] Titouan Parcollet, Morchid Mohamed, and Georges Linarès, “Quaternion denoising encoderdecoder for theme identification of telephone conversations,” Proc. Interspeech 2017, pp. 3325–3328, 2017.
 [13] Chase J Gaudet and Anthony S Maida, “Deep quaternion networks,” in 2018 International Joint Conference on Neural Networks (IJCNN). IEEE, 2018, pp. 1–8.

[14]
Titouan Parcollet, Ying Zhang, Mohamed Morchid, Chiheb Trabelsi, Georges
Linarès, Renato de Mori, and Yoshua Bengio,
“Quaternion convolutional neural networks for endtoend automatic speech recognition,”
in Interspeech 2018, 19th Annual Conference of the International Speech Communication Association, Hyderabad, India, 26 September 2018., 2018, pp. 22–26.  [15] Titouan Parcollet, Mirco Ravanelli, Mohamed Morchid, Georges Linarès, Renato De Mori, and Yoshua Bengio, “Quaternion recurrent neural networks,” 2018.

[16]
Toshifumi Minemoto, Teijiro Isokawa, Haruhiko Nishimura, and Nobuyuki Matsui,
“Feed forward neural network with random quaternionic neurons,”
Signal Processing, vol. 136, pp. 59–68, 2017.  [17] Ivo Danihelka, Greg Wayne, Benigno Uria, Nal Kalchbrenner, and Alex Graves, “Associative long shortterm memory,” arXiv preprint arXiv:1602.03032, 2016.
 [18] D Xu, L Zhang, and H Zhang, “Learning alogrithms in quaternion neural networks using ghr calculus,” Neural Network World, vol. 27, no. 3, pp. 271, 2017.
 [19] Titouan Parcollet, Mohamed Morchid, and Georges Linares, “Deep quaternion neural networks for spoken language understanding,” in Automatic Speech Recognition and Understanding Workshop (ASRU), 2017 IEEE. IEEE, 2017, pp. 504–511.
 [20] Alex Graves and Navdeep Jaitly, “Towards endtoend speech recognition with recurrent neural networks,” in International Conference on Machine Learning, 2014, pp. 1764–1772.
 [21] Sepp Hochreiter and Jürgen Schmidhuber, “Long shortterm memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
 [22] William Chan and Ian Lane, “Deep recurrent neural networks for acoustic modelling,” arXiv preprint arXiv:1504.01482, 2015.
Comments
There are no comments yet.