1 Introduction
In the last few years, deep neural networks (DNN) have encountered a wide success in different domains due to their capability to learn highly complex input to output mapping. Among the different DNNbased models, the recurrent neural network (RNN) is well adapted to process sequential data. Indeed, RNNs build a vector of activations at each timestep to code latent relations between input vectors. Deep RNNs have been recently used to obtain hidden representations of speech unit sequences
(Ravanelli et al., 2018a) or text word sequences (Conneau et al., 2018), and to achieve stateoftheart performances in many speech recognition tasks (Graves et al., 2013a, b; Amodei et al., 2016; Povey et al., 2016; Chiu et al., 2018). However, many recent tasks based on multidimensional input features, such as pixels of an image, acoustic features, or orientations of D models, require to represent both external dependencies between different entities, and internal relations between the features that compose each entity. Moreover, RNNbased algorithms commonly require a huge number of parameters to represent sequential data in the hidden space.Quaternions are hypercomplex numbers that contain a real and three separate imaginary components, perfectly fitting to and dimensional feature vectors, such as for image processing and robot kinematics (Sangwine, 1996; Pei & Cheng, 1999; Aspragathos & Dimitros, 1998). The idea of bundling groups of numbers into separate entities is also exploited by the recent manifold and capsule networks (Chakraborty et al., 2018; Sabour et al., 2017). Contrary to traditional homogeneous representations, capsule and quaternion networks bundle sets of features together. Thereby, quaternion numbers allow neural network based models to code latent interdependencies between groups of input features during the learning process with fewer parameters than RNNs, by taking advantage of the Hamilton product
as the equivalent of the ordinary product, but between quaternions. Early applications of quaternionvalued backpropagation algorithms
(Arena et al., 1994, 1997) have efficiently solved quaternion functions approximation tasks. More recently, neural networks of complex and hypercomplex numbers have received an increasing attention (Hirose & Yoshida, 2012; Tygert et al., 2016; Danihelka et al., 2016; Wisdom et al., 2016), and some efforts have shown promising results in different applications. In particular, a deep quaternion network (Parcollet et al., 2016, 2017a, 2017b), a deep quaternion convolutional network (Gaudet & Maida, 2018; Parcollet et al., 2018), or a deep complex convolutional network (Trabelsi et al., 2017) have been employed for challenging tasks such as images and language processing. However, these applications do not include recurrent neural networks with operations defined by the quaternion algebra.This paper proposes to integrate local spectral features in a novel model called quaternion recurrent neural network^{1}^{1}1https://github.com/OrkisResearch/PytorchQuaternionNeuralNetworks (QRNN), and its gated extension called quaternion longshort term memory neural network (QLSTM). The model is proposed along with a welladapted parameters initialization and turned out to learn both inter and intradependencies between multidimensional input features and the basic elements of a sequence with drastically fewer parameters (Section 3), making the approach more suitable for lowresource applications. The effectiveness of the proposed QRNN and QLSTM is evaluated on the realistic TIMIT phoneme recognition task (Section 4.2) that shows that both QRNN and QLSTM obtain better performances than RNNs and LSTMs with a best observed phoneme error rate (PER) of and for QRNN and QLSTM, compared to and for RNN and LSTM. Moreover, these results are obtained alongside with a reduction of times of the number of free parameters. Similar results are observed with the larger Wall Street Journal (WSJ) dataset, whose detailed performances are reported in the Appendix 6.1.1.
2 Motivations
A major challenge of current machine learning models is to wellrepresent in the latent space the astonishing amount of data available for recent tasks. For this purpose, a good model has to efficiently encode local relations within the input features, such as between the Red, Green, and Blue (R,G,B) channels of a single image pixel, as well as structural relations, such as those describing edges or shapes composed by groups of pixels. Moreover, in order to learn an adequate representation with the available set of training data and to avoid overfitting, it is convenient to conceive a neural architecture with the smallest number of parameters to be estimated. In the following, we detail the motivations to employ a quaternionvalued RNN instead of a realvalued one to code inter and intra features dependencies with fewer parameters.
As a first step, a better representation of multidimensional data has to be explored to naturally capture internal relations within the input features. For example, an efficient way to represent the information composing an image is to consider each pixel as being a whole entity of three strongly related elements, instead of a group of unidimensional elements that could be related to each other, as in traditional realvalued neural networks. Indeed, with a realvalued RNN, the latent relations between the RGB components of a given pixel are hardly coded in the latent space since the weight has to find out these relations among all the pixels composing the image. This problem is effectively solved by replacing real numbers with quaternion numbers. Indeed, quaternions are fourth dimensional and allow one to build and process entities made of up to four related features. The quaternion algebra and more precisely the Hamilton product allows quaternion neural network to capture these internal latent relations within the features encoded in a quaternion. It has been shown that QNNs are able to restore the spatial relations within D coordinates (Matsui et al., 2004), and within color pixels (Isokawa et al., 2003), while realvalued NN failed. This is easily explained by the fact that the quaternionweight components are shared through multiple quaternioninput parts during the Hamilton product , creating relations within the elements. Indeed, Figure 1 shows that the multiple weights required to code latent relations within a feature are considered at the same level as for learning global relations between different features, while the quaternion weight codes these internal relations within a unique quaternion during the Hamilton product (right).
Then, while bigger neural networks allow better performances, quaternion neural networks make it possible to deal with the same signal dimension but with four times less neural parameters. Indeed, a number quaternion weight linking two 4number quaternion units only has degrees of freedom, whereas a standard neural net parametrization has
, i.e., a 4fold saving in memory. Therefore, the natural multidimensional representation of quaternions alongside with their ability to drastically reduce the number of parameters indicate that hypercomplex numbers are a better fit than real numbers to create more efficient models in multidimensional spaces. Based on the success of previous deep quaternion convolutional neural networks and smaller quaternion feedforward architectures
(Kusamichi et al., 2004; Isokawa et al., 2009; Parcollet et al., 2017a), this work proposes to adapt the representation of hypercomplex numbers to the capability of recurrent neural networks in a natural and efficient framework to multidimensional sequential tasks such as speech recognition.Modern automatic speech recognition systems usually employ input sequences composed of multidimensional acoustic features, such as log Mel features, that are often enriched with their first, second and third time derivatives (Davis & Mermelstein, 1990; Furui, 1986), to integrate contextual information. In standard RNNs, static features are simply concatenated with their derivatives to form a large input vector, without effectively considering that signal derivatives represent different views of the same input. Nonetheless, it is crucial to consider that time derivatives of the spectral energy in a given frequency band at a specific time frame represent a special state of a timeframe, and are linearly correlated (Tokuda et al., 2003). Based on the above motivations and the results observed on previous works about quaternion neural networks, we hypothesize that quaternion RNNs naturally provide a more suitable representation of the input sequence, since these multiple views can be directly embedded in the multiple dimensions space of the quaternion, leading to better generalization.
3 Quaternion recurrent neural networks
This Section describes the quaternion algebra (Section 3.1), the internal quaternion representation (Section 3.2), the backpropagation through time (BPTT) for quaternions (Section 3.3.2
), and proposes an adapted weight initialization to quaternionvalued neurons (Section
3.4).3.1 Quaternion algebra
The quaternion algebra defines operations between quaternion numbers. A quaternion Q is an extension of a complex number defined in a four dimensional space as:
(1) 
where , , , and are real numbers, and , i, j, and k are the quaternion unit basis. In a quaternion, is the real part, while with is the imaginary part, or the vector part. Such a definition can be used to describe spatial rotations. The information embedded in the quaterion can be summarized into the following matrix of real numbers:
(2) 
The conjugate of is defined as:
(3) 
Then, a normalized or unit quaternion is expressed as:
(4) 
Finally, the Hamilton product between two quaternions and is computed as follows:
(5) 
The Hamilton product (a graphical view is depicted in Figure 1
) is used in QRNNs to perform transformations of vectors representing quaternions, as well as scaling and interpolation between two rotations following a geodesic over a sphere in the
space as shown in (Minemoto et al., 2017).3.2 Quaternion representation
The QRNN is an extension of the realvalued (Medsker & Jain, 2001) and complexvalued (Hu & Wang, 2012; Song & Yam, 1998) recurrent neural networks to hypercomplex numbers. In a quaternion dense layer, all parameters are quaternions, including inputs, outputs, weights, and biases. The quaternion algebra is ensured by manipulating matrices of real numbers (Gaudet & Maida, 2018). Consequently, for each input vector of size , output vector of size , dimensions are split into four parts: the first one equals to , the second is , the third one equals to , and the last one to to compose a quaternion . The inference process of a fullyconnected layer is defined in the realvalued space by the dot product between an input vector and a realvalued weight matrix. In a QRNN, this operation is replaced with the Hamilton product (Eq. 3.1) with quaternionvalued matrices (i.e. each entry in the weight matrix is a quaternion). The computational complexity of quaternionvalued models is discussed in Appendix 6.1.2
3.3 Learning algorithm
The QRNN differs from the realvalued RNN in each learning subprocesses. Therefore, let be the input vector at timestep , the hidden state, , and the input, output and hidden states weight matrices respectively. The vector is the bias of the hidden state and , are the output and the expected target vectors. More details of the learning process and the parametrization are available on Appendix 6.2.
3.3.1 Forward phase
Based on the forward propagation of the realvalued RNN (Medsker & Jain, 2001), the QRNN forward equations are extended as follows:
(6) 
where is a
quaternion split activation function
(Xu et al., 2017; Tripathi, 2016) defined as:(7) 
with corresponding to any standard activation function. The split approach is preferred in this work due to better prior investigations, better stability (i.e. pure quaternion activation functions contain singularities), and simpler computations. The output vector is computed as:
(8) 
where is any split activation function. Finally, the objective function is a classical loss applied componentwise (e.g., mean squared error, negative loglikelihood).
3.3.2 Quaternion Backpropagation Through Time
The backpropagation through time (BPTT) for quaternion numbers (QBPTT) is an extension of the standard quaternion backpropagation (Nitta, 1995), and its full derivation is available in Appendix 6.3. The gradient with respect to the loss is expressed for each weight matrix as , ,
, for the bias vector as
, and is generalized to with:(9) 
Each term of the above relation is then computed by applying the chain rule. Indeed, and conversaly to realvalued backpropagation, QBPTT must defines the dynamic of the loss
w.r.t to each component of the quaternion neural parameters. As a usecase for the equations, the mean squared error at a timestep and namedis used as the loss function. Moreover, let
be a fixed learning rate. First, the weight matrix is only seen in the equations of . It is therefore straightforward to update each weight of at timestep following:(10) 
where is the conjugate of . Then, the weight matrices , and biases are arguments of with involved, and the update equations are derived as:
(11) 
with,
(12) 
and,
(13) 
with and the preactivation values of and respectively.
3.4 Parameter initialization
A welldesigned parameter initialization scheme strongly impacts the efficiency of a DNN. An appropriate initialization, in fact, improves DNN convergence, reduces the risk of exploding or vanishing gradient, and often leads to a substantial performance improvement (Glorot & Bengio, 2010). It has been shown that the backpropagation through time algorithm of RNNs is degraded by an inappropriated parameter initialization (Sutskever et al., 2013). Moreover, an hypercomplex parameter cannot be simply initialized randomly and componentwise, due to the interactions between components. Therefore, this Section proposes a procedure reported in Algorithm 1 to initialize a matrix of quaternionvalued weights. The proposed initialization equations are derived from the polar form of a weight of :
(14) 
and,
(15) 
The angle is randomly generated in the interval . The quaternion is defined as purely normalized imaginary, and is expressed as . The imaginary components xi, yj, and zk
are sampled from an uniform distribution in
to obtain , which is then normalized (following Eq. 4) to obtain . The parameter is a random number generated with respect to wellknown initialization criterions (such as Glorot or He algorithms) (Glorot & Bengio, 2010; He et al., 2015). However, the equations derived in (Glorot & Bengio, 2010; He et al., 2015)are defined for realvalued weight matrices. Therefore, the variance of
has to be investigated in the quaternion space to obtain (the full demonstration is provided in Appendix 6.2). The variance of is:(16) 
Indeed, the weight distribution is normalized. The value of , instead, is not trivial in the case of quaternionvalued matrices. Indeed, follows a Chidistribution with four degrees of freedom (DOFs). Consequently, is expressed and computed as follows:
(17) 
The Glorot (Glorot & Bengio, 2010) and He (He et al., 2015) criterions are extended to quaternion as:
(18) 
with and the number of neurons of the input and output layers respectively. Finally, can be sampled from to complete the weight initialization of Eq. 15.
4 Experiments
This Section details the acoustic features extraction (Section
4.1), the experimental setups and the results obtained with QRNNs, QLSTMs, RNNs and LSTMs on the TIMIT speech recognition tasks (Section 4.2). The results reported in bold on tables are obtained with the best configurations of the neural networks observed with the validation set.4.1 Quaternion acoustic features
The raw audio is first splitted every ms with a window of ms. Then dimensional log Melfilterbank coefficients with first, second, and third order derivatives are extracted using the pytorchkaldi^{2}^{2}2pytorchkaldi is available at https://github.com/mravanelli/pytorchkaldi (Ravanelli et al., 2018b) toolkit and the Kaldi s5 recipes (Povey et al., 2011). An acoustic quaternion associated with a frequency and a timeframe is formed as follows:
(19) 
represents multiple views of a frequency at time frame , consisting of the energy in the filter band at frequency , its first time derivative describing a slope view, its second time derivative describing a concavity view, and the third derivative describing the rate of change of the second derivative. Quaternions are used to learn the spatial relations that exist between the described different views that characterize a same frequency (Tokuda et al., 2003). Thus, the quaternion input vector length is . Decoding is based on Kaldi (Povey et al., 2011) and weighted finite state transducers (WFST) (Mohri et al., 2002)
that integrate acoustic, lexicon and language model probabilities into a single HMMbased search graph.
4.2 The TIMIT corpus
The training process is based on the standard sentences uttered by speakers, while testing is conducted on sentences uttered by speakers of the TIMIT (Garofolo et al., 1993) dataset. A validation set composed of sentences uttered by speakers is used for hyperparameter tuning. The models are compared on a fixed number of layers and by varying the number of neurons from to , and to for the RNN and QRNN respectively. Indeed, it is worth underlying that the number of hidden neurons in the quaternion and real spaces do not handle the same amount of realnumber values. Indeed, quaternion neurons output are
real values. Tanh activations are used across all the layers except for the output layer that is based on a softmax function. Models are optimized with RMSPROP with vanilla hyperparameters and an initial learning rate of
. The learning rate is progressively annealed using a halving factor of that is applied when no performance improvement on the validation set is observed. The models are trained during epochs. All the models converged to a minimum loss, due to the annealed learning rate. A dropout rate of is applied over all the hidden layers (Srivastava et al., 2014) except the output one. The negative loglikelihood loss function is used as an objective function. All the experiments are repeated times (5folds) with different seeds and are averaged to limit any variation due to the random initialization.Models  Neurons  Dev.  Test  Params 

RNN  256  22.4  23.4  1M 
512  19.6  20.4  2.8M  
1,024  17.9  19.0  9.4M  
2,048  20.0  20.7  33.4M  
QRNN  64  23.6  23.9  0.6M 
128  19.2  20.1  1.4M  
256  17.4  18.5  3.8M  
512  17.5  18.7  11.2M 
The results on the TIMIT task are reported in Table 1. The best PER in realistic conditions (w.r.t to the best validation PER) is and on the test set for QRNN and RNN models respectively, highlighting an absolute improvement of obtained with QRNN. These results compare favorably with the best results obtained so far with architectures that do not integrate access control in multiple memory layers (Ravanelli et al., 2018a). In the latter, a PER of
% is reported on the TIMIT test set with batchnormalized RNNs . Moreover, a remarkable advantage of QRNNs is a drastic reduction (with a factor of
) of the parameters needed to achieve these results. Indeed, such PERs are obtained with models that employ the same internal dimensionality corresponding to realvalued neurons and quaternionvalued ones, resulting in a number of parameters of M for QRNN against the M used in the realvalued RNN. It is also worth noting that QRNNs consistently need fewer parameters than equivalently sized RNNs, with an average reduction factor of times. This is easily explained by considering the content of the quaternion algebra. Indeed, for a fullyconnected layer with input values and hidden units, a realvalued RNN has M parameters, while to maintain equal input and output dimensions the quaternion equivalent has quaternions inputs and quaternion hidden units. Therefore, the number of parameters for the quaternionvalued model is M. Such a complexity reduction turns out to produce better results and has other advantages such as a smaller memory footprint while saving models on budget memory systems. This characteristic makes our QRNN model particularly suitable for speech recognition conducted on low computational power devices like smartphones (Chen et al., 2014). QRNNs and RNNs accuracies vary accordingly to the architecture with better PER on bigger and wider topologies. Therefore, while good PER are observed with a higher number of parameters, smaller architectures performed at and , with M and M parameters for the RNN and the QRNN respectively. Such PER are due to a too small number of parameters to solve the task.4.3 Quaternion longshort term memory neural networks
We propose to extend the QRNN to stateoftheart models such as longshort term memory neural networks (LSTM), to support and improve the results already observed with the QRNN compared to the RNN in more realistic conditions. LSTM (Hochreiter & Schmidhuber, 1997) neural networks were introduced to solve the problems of longterm dependencies learning and vanishing or exploding gradient observed with long sequences. Based on the equations of the forward propagation and back propagation through time of QRNN described in Section 3.3.1, and Section 3.3.2, one can easily derive the equations of a quaternionvalued LSTM. Gates are defined with quaternion numbers following the proposal of Danihelka et al. (2016). Therefore, the gate action is characterized by an independent modification of each component of the quaternionvalued signal following a componentwise product with the quaternionvalued gate potential. Let ,, , , and be the forget, input, output gates, cell states and the hidden state of a LSTM cell at timestep :
(20)  
(21)  
(22)  
(23)  
(24) 
where are rectangular input weight matrices, are square recurrent weight matrices, and are bias vectors. is the split activation function and denotes a componentwise product between two quaternions. Both QLSTM and LSTM are bidirectional and trained on the same conditions than for the QRNN and RNN experiments.
Models  Neurons  Dev.  Test  Params 

LSTM  256  14.9  16.5  3.6M 
512  14.2  16.1  12.6M  
1,024  14.4  15.3  46.2M  
2,048  14.0  15.9  176.3M  
QLSTM  64  15.5  17.0  1.6M 
128  14.1  16.0  4.6M  
256  14.0  15.1  14.4M  
512  14.2  15.1  49.9M 
The results on the TIMIT corpus reported on Table 2 support the initial intuitions and the previously established trends. We first point out that the best PER observed is and on the test set for QLSTMs and LSTM models respectively with an absolute improvement of obtained with QLSTM using times fewer parameters compared to LSTM. These results are among the top of the line results (Graves et al., 2013b; Ravanelli et al., 2018a) and prove that the proposed quaternion approach can be used in stateoftheart models. A deeper investigation of QLSTMs performances with the larger Wall Street Journal (WSJ) dataset can be found in Appendix 6.1.1.
5 Conclusion
Summary.
This paper proposes to process sequences of multidimensional features (such as acoustic data) with a novel quaternion recurrent neural network (QRNN) and quaternion longshort term memory neural network (QLSTM). The experiments conducted on the TIMIT phoneme recognition task show that QRNNs and QLSTMs are more effective to learn a compact representation of multidimensional information by outperforming RNNs and LSTMs with to times less free parameters. Therefore, our initial intuition that the quaternion algebra offers a better and more compact representation for multidimensional features, alongside with a better learning capability of feature internal dependencies through the Hamilton product, have been demonstrated.
Future Work
. Future investigations will develop other multiview features that contribute to decrease ambiguities in representing phonemes in the quaternion space. In this extent, a recent approach based on a quaternion Fourier transform to create quaternionvalued signal has to be investigated. Finally, other highdimensional neural networks such as manifold and Clifford networks remain mostly unexplored and can benefit from further research.
References
 Amodei et al. (2016) Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, et al. Deep speech 2: Endtoend speech recognition in english and mandarin. In International Conference on Machine Learning, pp. 173–182, 2016.
 Arena et al. (1994) Paolo Arena, Luigi Fortuna, Luigi Occhipinti, and Maria Gabriella Xibilia. Neural networks for quaternionvalued function approximation. In Circuits and Systems, ISCAS’94., IEEE International Symposium on, volume 6, pp. 307–310. IEEE, 1994.
 Arena et al. (1997) Paolo Arena, Luigi Fortuna, Giovanni Muscato, and Maria Gabriella Xibilia. Multilayer perceptrons to approximate quaternion valued functions. Neural Networks, 10(2):335–342, 1997.
 Aspragathos & Dimitros (1998) Nicholas A Aspragathos and John K Dimitros. A comparative study of three methods for robot kinematics. Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on, 28(2):135–145, 1998.
 Chakraborty et al. (2018) Rudrasis Chakraborty, Jose Bouza, Jonathan Manton, and Baba C. Vemuri. Manifoldnet: A deep network framework for manifoldvalued data. arXiv preprint arXiv:1809.06211, 2018.
 Chan & Lane (2015) William Chan and Ian Lane. Deep recurrent neural networks for acoustic modelling. arXiv preprint arXiv:1504.01482, 2015.
 Chen et al. (2014) G. Chen, C. Parada, and G. Heigold. Smallfootprint keyword spotting using deep neural networks. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4087–4091, May 2014. doi: 10.1109/ICASSP.2014.6854370.
 Chiu et al. (2018) ChungCheng Chiu, Tara N Sainath, Yonghui Wu, Rohit Prabhavalkar, Patrick Nguyen, Zhifeng Chen, Anjuli Kannan, Ron J Weiss, Kanishka Rao, Ekaterina Gonina, et al. Stateoftheart speech recognition with sequencetosequence models. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4774–4778. IEEE, 2018.
 Conneau et al. (2018) Alexis Conneau, German Kruszewski, Guillaume Lample, Loïc Barrault, and Marco Baroni. What you can cram into a single vector: Probing sentence embeddings for linguistic properties, 2018.
 Danihelka et al. (2016) Ivo Danihelka, Greg Wayne, Benigno Uria, Nal Kalchbrenner, and Alex Graves. Associative long shortterm memory. arXiv preprint arXiv:1602.03032, 2016.
 Davis & Mermelstein (1990) Steven B Davis and Paul Mermelstein. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. In Readings in speech recognition, pp. 65–74. Elsevier, 1990.
 Furui (1986) Sadaoki Furui. Speakerindependent isolated word recognition based on emphasized spectral dynamics. In Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP’86., volume 11, pp. 1991–1994. IEEE, 1986.
 Garofolo et al. (1993) John S Garofolo, Lori F Lamel, William M Fisher, Jonathan G Fiscus, and David S Pallett. Darpa timit acousticphonetic continous speech corpus cdrom. nist speech disc 11.1. NASA STI/Recon technical report n, 93, 1993.
 Gaudet & Maida (2018) Chase J Gaudet and Anthony S Maida. Deep quaternion networks. In 2018 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE, 2018.

Glorot & Bengio (2010)
Xavier Glorot and Yoshua Bengio.
Understanding the difficulty of training deep feedforward neural
networks.
In
International conference on artificial intelligence and statistics
, pp. 249–256, 2010.  Graves et al. (2013a) Alex Graves, Navdeep Jaitly, and Abdelrahman Mohamed. Hybrid speech recognition with deep bidirectional lstm. In Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on, pp. 273–278. IEEE, 2013a.
 Graves et al. (2013b) Alex Graves, Abdelrahman Mohamed, and Geoffrey Hinton. Speech recognition with deep recurrent neural networks. In Acoustics, speech and signal processing (icassp), 2013 ieee international conference on, pp. 6645–6649. IEEE, 2013b.

He et al. (2015)
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification.
InProceedings of the IEEE international conference on computer vision
, pp. 1026–1034, 2015.  Hirose & Yoshida (2012) Akira Hirose and Shotaro Yoshida. Generalization characteristics of complexvalued feedforward neural networks in relation to signal coherence. IEEE Transactions on Neural Networks and learning systems, 23(4):541–551, 2012.
 Hochreiter & Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.
 Hu & Wang (2012) Jin Hu and Jun Wang. Global stability of complexvalued recurrent neural networks with timedelays. IEEE Transactions on Neural Networks and Learning Systems, 23(6):853–865, 2012.
 Isokawa et al. (2003) Teijiro Isokawa, Tomoaki Kusakabe, Nobuyuki Matsui, and Ferdinand Peper. Quaternion neural network and its application. In International Conference on KnowledgeBased and Intelligent Information and Engineering Systems, pp. 318–324. Springer, 2003.
 Isokawa et al. (2009) Teijiro Isokawa, Nobuyuki Matsui, and Haruhiko Nishimura. Quaternionic neural networks: Fundamental properties and applications. ComplexValued Neural Networks: Utilizing HighDimensional Parameters, pp. 411–439, 2009.
 Kingma & Ba (2014) Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Kusamichi et al. (2004) Hiromi Kusamichi, Teijiro Isokawa, Nobuyuki Matsui, Yuzo Ogawa, and Kazuaki Maeda. A new scheme for color night vision by quaternion neural network. In Proceedings of the 2nd International Conference on Autonomous Robots and Agents, volume 1315, 2004.
 Matsui et al. (2004) Nobuyuki Matsui, Teijiro Isokawa, Hiromi Kusamichi, Ferdinand Peper, and Haruhiko Nishimura. Quaternion neural network with geometrical operators. Journal of Intelligent & Fuzzy Systems, 15(3, 4):149–164, 2004.
 Medsker & Jain (2001) Larry R. Medsker and Lakhmi J. Jain. Recurrent neural networks. Design and Applications, 5, 2001.
 Minemoto et al. (2017) Toshifumi Minemoto, Teijiro Isokawa, Haruhiko Nishimura, and Nobuyuki Matsui. Feed forward neural network with random quaternionic neurons. Signal Processing, 136:59–68, 2017.
 Mohri et al. (2002) Mehryar Mohri, Fernando Pereira, and Michael Riley. Weighted finitestate transducers in speech recognition. Computer Speech and Language, 16(1):69 – 88, 2002. ISSN 08852308. doi: https://doi.org/10.1006/csla.2001.0184. URL http://www.sciencedirect.com/science/article/pii/S0885230801901846.

Morchid (2018)
Mohamed Morchid.
Parsimonious memory unit for recurrent neural networks with application to natural language processing.
Neurocomputing, 314:48–64, 2018.  Nitta (1995) Tohru Nitta. A quaternary version of the backpropagation algorithm. In Neural Networks, 1995. Proceedings., IEEE International Conference on, volume 5, pp. 2753–2756. IEEE, 1995.
 Parcollet et al. (2016) Titouan Parcollet, Mohamed Morchid, PierreMichel Bousquet, Richard Dufour, Georges Linarès, and Renato De Mori. Quaternion neural networks for spoken language understanding. In Spoken Language Technology Workshop (SLT), 2016 IEEE, pp. 362–368. IEEE, 2016.
 Parcollet et al. (2017a) Titouan Parcollet, Morchid Mohamed, and Georges Linarès. Quaternion denoising encoderdecoder for theme identification of telephone conversations. Proc. Interspeech 2017, pp. 3325–3328, 2017a.
 Parcollet et al. (2017b) Titouan Parcollet, Mohamed Morchid, and Georges Linares. Deep quaternion neural networks for spoken language understanding. In Automatic Speech Recognition and Understanding Workshop (ASRU), 2017 IEEE, pp. 504–511. IEEE, 2017b.
 Parcollet et al. (2018) Titouan Parcollet, Ying Zhang, Mohamed Morchid, Chiheb Trabelsi, Georges Linarès, Renato de Mori, and Yoshua Bengio. Quaternion convolutional neural networks for endtoend automatic speech recognition. In Interspeech 2018, 19th Annual Conference of the International Speech Communication Association, Hyderabad, India, 26 September 2018., pp. 22–26, 2018. doi: 10.21437/Interspeech.20181898. URL https://doi.org/10.21437/Interspeech.20181898.

Pei & Cheng (1999)
SooChang Pei and ChingMin Cheng.
Color image processing by using binary quaternionmomentpreserving thresholding technique.
IEEE Transactions on Image Processing, 8(5):614–628, 1999.  Povey et al. (2011) Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, Jan Silovsky, Georg Stemmer, and Karel Vesely. The kaldi speech recognition toolkit. In IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. IEEE Signal Processing Society, December 2011. IEEE Catalog No.: CFP11SRWUSB.
 Povey et al. (2016) Daniel Povey, Vijayaditya Peddinti, Daniel Galvez, Pegah Ghahremani, Vimal Manohar, Xingyu Na, Yiming Wang, and Sanjeev Khudanpur. Purely sequencetrained neural networks for asr based on latticefree mmi. In Interspeech, pp. 2751–2755, 2016.

Ravanelli et al. (2018a)
Mirco Ravanelli, Philemon Brakel, Maurizio Omologo, and Yoshua Bengio.
Light gated recurrent units for speech recognition.
IEEE Transactions on Emerging Topics in Computational Intelligence, 2(2):92–102, 2018a.  Ravanelli et al. (2018b) Mirco Ravanelli, Titouan Parcollet, and Yoshua Bengio. The pytorchkaldi speech recognition toolkit. arXiv preprint arXiv:1811.07453, 2018b.
 Sabour et al. (2017) Sara Sabour, Nicholas Frosst, and Geoffrey E Hinton. Dynamic routing between capsules. arXiv preprint arXiv:1710.09829v2, 2017.
 Sangwine (1996) Stephen John Sangwine. Fourier transforms of colour images using quaternion or hypercomplex, numbers. Electronics letters, 32(21):1979–1980, 1996.
 Song & Yam (1998) Jingyan Song and Yeung Yam. Complex recurrent neural network for computing the inverse and pseudoinverse of the complex matrix. Applied mathematics and computation, 93(23):195–205, 1998.
 Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.

Sutskever et al. (2013)
Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton.
On the importance of initialization and momentum in deep learning.
In International conference on machine learning, pp. 1139–1147, 2013.  Tokuda et al. (2003) Keiichi Tokuda, Heiga Zen, and Tadashi Kitamura. Trajectory modeling based on hmms with the explicit relationship between static and dynamic features. In Eighth European Conference on Speech Communication and Technology, 2003.
 Trabelsi et al. (2017) Chiheb Trabelsi, Olexa Bilaniuk, Dmitriy Serdyuk, Sandeep Subramanian, João Felipe Santos, Soroush Mehri, Negar Rostamzadeh, Yoshua Bengio, and Christopher J Pal. Deep complex networks. arXiv preprint arXiv:1705.09792, 2017.
 Tripathi (2016) Bipin Kumar Tripathi. High Dimensional Neurocomputing. Springer, 2016.
 Tygert et al. (2016) Mark Tygert, Joan Bruna, Soumith Chintala, Yann LeCun, Serkan Piantino, and Arthur Szlam. A mathematical motivation for complexvalued convolutional networks. Neural computation, 28(5):815–825, 2016.
 Wisdom et al. (2016) Scott Wisdom, Thomas Powers, John Hershey, Jonathan Le Roux, and Les Atlas. Fullcapacity unitary recurrent neural networks. In Advances in Neural Information Processing Systems, pp. 4880–4888, 2016.
 Xu et al. (2017) D Xu, L Zhang, and H Zhang. Learning alogrithms in quaternion neural networks using ghr calculus. Neural Network World, 27(3):271, 2017.
6 Appendix
6.1 Wall Street Journal experiments and computational complexity
This Section proposes to validate the scaling of the proposed QLSTMs to a bigger and more realistic corpus, with a speech recognition task on the Wall Street Journal (WSJ) dataset. Finally, it discuses the impact of the quaternion algebra in term of computational compexity.
6.1.1 Speech recognition with the Wall Street Journal corpus
We propose to evaluate both QLSTMs and LSTMs with a larger and more realistic corpus to validate the scaling of the observed TIMIT results (Section 4.2). Acoustic input features are described in Section 4.1, and extracted on both the hour subset ‘trainsi84’, and the full hour dataset ’trainsi284’ of the Wall Street Journal (WSJ) corpus. The ‘testdev93’ development set is employed for validation, while ’testeval92’ composes the testing set. Models architectures are fixed with respect to the best results observed with the TIMIT corpus (Section 4.2). Therefore, both QLSTMs and LSTMs contain four bidirectional layers of internal dimension of size . Then, an additional layer of internal size is added before the output layer. The only change on the training procedure compared to the TIMIT experiments concerns the model optimizer, which is set to Adam (Kingma & Ba, 2014) instead of RMSPROP. Results are from a folds average.
Models  WSJ14 Dev.  WSJ14 Test  WSJ81 Dev.  WSJ81 Test  Params 

LSTM  11.2  7.2  7.4  4.5  53.7M 
QLSTM  10.9  6.9  7.2  4.3  18.7M 
It is important to notice that reported results on Table 3 compare favorably with equivalent architectures (Graves et al., 2013a) (WER of on ’testdev93’), and are competitive with stateoftheart and much more complex models based on better engineered features (Chan & Lane, 2015)(WER of with the 81 hours of training data, and on ’testeval92’). According to Table 3, QLSTMs outperform LSTM in all the training conditions ( hours and hours) and with respect to both the validation and testing sets. Moreover, QLSTMs still need times less neural parameters than LSTMs to achieve such performances. This experiment demonstrates that QLSTMs scale well to larger and more realistic speech datasets and are still more efficient than realvalued LSTMs.
6.1.2 Notes on computational complexity
A computational complexity of with the number of hidden states has been reported by Morchid (2018) for realvalued LSTMs. QLSTMs just involve times larger matrices during computations. Therefore, the computational complexity remains unchanged and equals to . Nonetheless, and due to the Hamilton product, a single forward propagation between two quaternion neurons uses operations, compared to a single one for two realvalued neurons, implying a longer training time (up to times slower). However, such worst speed performances could easily be alleviated with a proper engineered cuDNN kernel for the Hamilton product, that would helps QNNs to be more efficient than realvalued ones. A welladapted CUDA kernel would allow QNNs to perform more computations, with fewer parameters, and therefore less memory copy operations from the CPU to the GPU.
6.2 Parameters initialization
Let us recall that a generated quaternion weight from a weight matrix has a polar form defined as:
(25) 
with a purely imaginary and normalized quaternion. Therefore, can be computed following:
(26) 
However, represents a randomly generated variable with respect to the variance of the quaternion weight and the selected initialization criterion. The initialization process follows (Glorot & Bengio, 2010) and (He et al., 2015) to derive the variance of the quaternionvalued weight parameters. Indeed, the variance of W has to be investigated:
(27) 
is equals to since the weight distribution is symmetric around . Nonetheless, the value of is not trivial in the case of quaternionvalued matrices. Indeed, follows a Chidistribution with four degrees of freedom (DOFs) and is expressed and computed as follows:
(28) 
With
is the probability density function with four DOFs. A fourdimensional vector
is considered to evaluate the density function .has components that are normally distributed, centered at zero, and independent. Then,
, , and have density functions:(29) 
The fourdimensional vector has a length defined as
with a cumulative distribution function
in the 4sphere (nsphere with ) :(30) 
where and . The polar representations of the coordinates of in a 4dimensional space are defined to compute :
where is the magnitude () and , , and are the phases with , and . Then, is evaluated with the Jacobian of defined as:
And,
(31) 
Therefore, by the Jacobian , we have the polar form:
(32) 
Then, writing Eq.(30) in polar coordinates, we obtain:
Then,
(33) 
The probability density function for is the derivative of its cumulative distribution function, which by the fundamental theorem of calculus is:
(34) 
The expectation of the squared magnitude becomes:
With integration by parts we obtain:
(35) 
The expectation is the sum of two terms. The first one:
Based on the L’Hôpital’s rule, the undetermined limit becomes:
(36)  
With is polynomial and has a limit to . The second term is calculated in a same way (integration by parts) and becomes from Eq.(35):
The limit of first term is equals to with the same method than in Eq.(36). Therefore, the expectation is:
(38) 
And finally the variance is:
(39) 
6.3 Quaternion backpropagation through time
Let us recall the forward equations and parameters needed to derive the complete quaternion backpropagation through time (QBPTT) algorithm.
6.3.1 Recall of the forward phase
Let be the input vector at timestep , the hidden state, , and the hidden state, input and output weight matrices respectively. Finally is the biases vector of the hidden states and , are the output and the expected target vector.
(40) 
with,
(41) 
and is the quaternion split activation function (Xu et al., 2017) of a quaternion defined as:
(42) 
and corresponding to any standard activation function. The output vector can be computed as:
(43) 
with
(44) 
and any split activation function. Finally, the objective function is a realvalued loss function applied componentwise. The gradient with respect to the MSE loss is expressed for each weight matrix as , , , and for the bias vector as . In the realvalued space, the dynamic of the loss is only investigated based on all previously connected neurons. In this extent, the QBPTT differs from BPTT due to the fact that the loss must also be derived with respect to each component of a quaternion neural parameter, making it bilevel. This could act as a regularizer during the training process.
6.3.2 Output weight matrix
The weight matrix is used only in the computation of . It is therefore straightforward to compute :
(45) 
Each quaternion component is then derived following the chain rule:
(46) 
(47) 