1 Introduction
Stateoftheart speechrecognition models are based on deep neural networks [1] with weight matrices of dimensions in the order of thousands. We have shown that such models can be deployed offline on mobile devices [2]. Decentralizing the training of these models to be ondevice can improve personalization and security. However, the advanced optimization techniques used to train models require additional memory proportional in size to the model parameters. Therefore, one of the major obstacles to achieving highaccuracy ondevice models is the memory limitation of devices for training. Previous explorations to reduce training memory included training only on part of the model and/or splitting the gradient computation into multiple steps [3, 4].
In this paper, we propose using lowrank gradient approximation to reduce the training memory needed for advanced optimization techniques. Note that there are already methods in the literature that use lowrank structure to reduce model size [5, 6, 7]
. This proposal does not apply the lowrank approximation to the weight matrices but rather to the gradients, as to retain the full modeling power of the model. The approach we take is less computationally expensive than Singular Value Decomposition (SVD)
[8] and, again, does not constrain the model parameters by using the approximation exclusively as a vehicle for gradient computation.The remainder of the paper is organized as follows. Section 2 describes several optimization techniques and their associated training memory. Section 3 presents the proposed lowrank gradient optimization method. Section 4 analyses the effects of lowrank gradient approximation on training speed and convergence. Section 5 presents the experimental results on a personalization tasks for ondevice speech recognition.
2 Parameter Optimization
Deep neural networks are optimized by minimizing a loss function, which is a nonlinear function of the model parameters. This is done iteratively by updating the parameters in a direction that reduces the loss function. Let’s denote the loss by
, where is the weight matrix to be updated. The update formula can be computed as:(1) 
where and are the weight matrix and the corresponding update at training step . For gradient descent optimization, the update direction is given by the negative of the gradient of the loss with respect to the weight matrix:
(2) 
where is the learning rate and is the gradient at , which can be computed using error back propagation [9]. There are more advanced optimization techniques that compute a better update direction to improve training convergence. For example, the update direction for momentum optimization [10] is recursively computed as follows:
(3) 
where is the momentum coefficient. Additional memory is required to save (same size as ) for the subsequent training step (doubling the model size). For Adam optimization [11], the update direction is given by:
(4) 
where , and are scalar parameters. The first and second momentum terms, and , are computed recursively as:
(5)  
(6) 
The symbols and denote the elementwise multiplication and division operators. Two additional terms are introduced, which results in a memory requirement that is 3 times the size of the original model.
3 Lowrank Gradient Approximation
As shown in the previous section, advanced optimization techniques require more memory to store additional terms. To reduce the total amount of memory required, we propose using lowrank gradient approximation. Although lowrank approximation can be achieved using Singular Value Decomposition (SVD) [8], applying SVD to the gradients for each training step is computationally expensive. Instead, we propose to reparameterize the weight matrix into two parts:
(7) 
where is an unconstrained matrix with the same size as , and is a lowrank matrix of rank . If is a matrix, then and are matrices of sizes and , respectively (, ). With the reparameterization in Eq. 7, we can reduce training memory by keeping fixed and updating only and . However, this leads to a lowrank model, where the model parameter space is constrained to be low rank. In order to keep the model parameters unconstrained, we treat as the actual model parameters, and use and only for the purpose of gradient computation. Therefore, the update of is constrained to be lowrank by those of and such that the effective gradient of is given by:
(8) 
The gradients of and can be computed from the gradient of as follows:
(9)  
(10) 
By substituting Eq. 9 and 10 into Eq. 8 we can rewrite the effective gradient of as:
(11) 
where and are the lowrank projections of the rows and columns of , respectively.
It is useful to note that we can compute from the original model and then compute the gradients for and using Eq. 9 and 10. The reparameterization in Eq. 7 needs not be explicitly applied to the model (i.e. no need to modify the model computational graph). Instead, it can be applied by postprocessing the gradient. This makes it easy to apply lowrank gradient training to existing models.
For the case of gradient descent, the update direction is given by the negative of the gradient scaled by the learning rate (Eq. 2). From Eq. 8, we get
(12) 
and the corresponding change in the loss function (ignoring the higherorder terms):
(13)  
where we define for clarity. With the cyclic invariance of the trace, we can express the terms inside the trace as positive semidefinite matrices. This will result in a nonpositive change to the loss (), as and the trace of a positive semidefinite matrix is nonnegative. Note that in the unrestricted case (without lowrank projection), the change in loss is given by:
(14) 
In the special case where and are orthogonal matrices, the projection matrices and are diagonal matrices with elements 1 or 0. In fact, they are rank matrices with exactly entries of 1’s on the leading diagonal. A smaller will result in a smaller trace term in Eq. 13, and therefore a smaller reduction in loss. As a result, we expect lowrank approximation to slow down training convergence.
3.1 Random Gradient Projection
From Eq. 9 and 10, if and are zero matrices, their gradients will also be zero. Therefore, we need to assign nonzero values to and at each training step so that they can be updated. Ideally, we want to choose and to maximize the magnitude of in Eq. 13
using SVD. However, this will be computationally expensive. Instead, we assign them with random values. It can be shown that by drawing random values from a zeromean normal distribution with standard deviations of
and for and , respectively, and are close to an identity matrix ( and are approximately orthogonal). Furthermore, by comparing between Eq. 13 and 14, it is desirable to choose andsuch that the eigenvalues of
and are close to . This can be accomplished by drawing the values of and from and , respectively.3.2 Implementations
The lowrank gradient approximation method described above can be implemented by adding new variables and for each weight matrix in the model. Note that constraining the gradient of to be lowrank (using Eq. 11) does not necessarily yield a lowrank momentum term (e.g Eq. 6). Instead, it is easier to keep track of the momentum terms by updating and separately and use the updated and to update using Eq. 7. If and are updated by and respectively, the effective update of is given by
(15)  
(16) 
Note the additional secondorder term () in Eq. 16 compared to Eq. 12. This way, we are able to combine lowrank gradient approximation with existing advanced optimization techniques. In fact, in Eq. 16 can be rewritten as:
(17) 
That is, the update direction is given by the difference between the new and old lowrank matrix, .
The algorithm for computing the lowrank gradients is shown in Algorithm 1. For each training step, we first assign random values to and (lines 3 and 4). Next, we compute the gradients for and (lines 6 and 7) and updates and using standard optimization techniques, such as gradient descent, momentum, and Adam (lines 9 or 10). Finally, in line 12, we update using Eq. 17.
4 Analysis
We set up a simple problem to analyze and understand the behaviour of the proposed lowrank gradient method. The goal is to learn a matrix to match a target matrix, . The following mean squared error loss function is used:
(18) 
where and are matrices of size . The th element of and are denoted by and , respectively. We use to introduce nonlinearity to the function. We compared using different optimization methods and lowrank projection methods. none means that there is no lowrank gradient approximation, random refers to the case where and are randomly set per training step and svd means that and
are estimated by approximating the gradient of
using SVD. We performed 50,000 training steps for each configuration.Optimization  Projection  Loss  Training time 
Method  Method  (seconds)  
Gradient Descent  none  0.00510  22.4 
random  1.73146  24.0  
svd  0.44033  329.2  
Momentum  none  0.00059  20.2 
random  1.72933  29.5  
svd  0.31221  347.2  
Adam  none  0.00026  28.0 
random  0.00008  33.2  
svd  0.00028  317.6 
Table 1 shows the loss and training time after 50,000 training steps. In general, svd approximation yields a much better loss value after 50,000 training steps across different optimization methods, except for Adam optimization where all methods converged to a loss value of less than . On the other hand, the random method is only slightly slower than the standard method while the svd method takes an order magnitude longer time to train (due to the need to computed SVD every training step).
5 Experimental Results
We collected a dataset we called WikiNames [4] to evaluate the performance of speech personalization algorithms. The text prompts are sentences extracted from English Wikipedia pages that contain repeated occurrences of politician and artist names that are unique and difficult to recognize (we selected them by synthesizing speech for these names and verifying that our baseline recognizer recognizes them incorrectly).
The dataset aggregates speech data from 100 participants. Each participant provided 50 utterances (on average 4.6 minutes) of training data and 20 utterances (on average 1.9 minutes) of test data. The prompts for each user covered five names, each with 10 training utterances and 4 test utterances, with each name potentially appearing multiple times per utterance. The dataset includes accented and disfluent speech.
We used the WikiNames dataset for personalization experiments. The baseline ASR model is a recurrent neural network transducer (RNNT)
[12] as described in [2]. The models were trained using the efficient implementation [13]in TensorFlow
[14]. We measured the success of the modified model using the word error rate (WER) metric as well as the name recall rate [4] as described below:(19) 
where retrieved is the number of times names are present in the hypotheses and relevant refers to the number of times names appear in the reference transcripts. retrieved relevant indicates the number of relevant names that are correctly retrieved.
In addition to tracking quality metrics, we also quantify the impact of the algorithms on training memory by running ondevice benchmarks. Comparisons were made between different parameterization ranks, different optimizers, and fullrank, baseline model.
5.1 Memory Benchmark
The lowrank gradient model saved a significant amount of memory. Figure 1 shows the training memory with lowrank gradient projection versus the baseline (full rank) models, for both the momentum and Adam optimization methods. We adjusted the rank of the gradient projection matrix across experiments to observe the impact on memory. Figure 1 shows that the lowrank model uses less memory than the fullrank model using the momentum optimizer for a projection of rank 100. Any projection of a lower rank would continue to save memory. Similarly, the modified model was able to save training memory with the Adam optimizer for a projection of up to rank 200. Additionally, the graph illustrates that the training memory increases about linearly with rank. Furthermore, with rank 100 and 150, lowrank gradient projection with Adam optimization consumes about the same memory as full rank momentum optimization.
5.2 Speech Recognition Performance
The results in Figures 2 and 3 show how the speech recognition quality varies with increasing training steps for different settings. Figure 2 compares the WER for the momentum and Adam optimization methods with and without lowrank projection. Comparing the lowrank and fullrank models, Figure 2 shows that lowrank models converge slower for both momentum and Adam optimization. The latter achieved a better performance, indicating that we are able to take advantage of the benefit of Adam optimization by using it to update and .
Figure 3 shows that the word error rate decreases faster and to a lower rate as the rank of the gradient approximation is increased. Training the model using Adam with a gradient projection matrix of rank 150 reached a word error rate of 47.1% while the baseline model converges at a word error rate about 43.8%.
Similarly, Figure 4 shows that the name recall rate increases faster and to a higher rate with the higher rank models, as expected.
6 Summary
The experiments detailed above sought to explore an opportunity to save training memory for deep neural network models, such as those used for speech recognition. Approximating the gradient computation using lowrank parameters saves memory up to a rank of about 100 and 200 for the momentum and Adam optimization, respectively. These results are promising.
To observe the impact of the new training method on the effectiveness of the model, we did experiments on WikiNames, a dataset with accented speech and difficulttorecognize names. The most important metrics from these experiments are the word error rate and the recall for the names in the dataset. We compared how models of different ranks and different optimizers trained and how their training compared to the baseline model. For the model using the momentum optimizer, the rank did not impact training significantly. Furthermore, the lowrank model for momentum performed worse than the baseline momentum model. Predictably, the lowrank model with the Adam optimizer performed much better. Additionally, rank had an observable impact on training.
Using a lowrank approximation of the gradient computation for deep neural network models provides an opportunity to save memory without a significant increase in error rate or decrease in recall rate. This opportunity is most promising for on device training with more advanced optimizers, like Adam, that traditionally use multiple highdimensional parameters for gradient computation.
References
 [1] Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdelrahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al., “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82–97, 2012.
 [2] Yanzhang He, Tara N Sainath, Rohit Prabhavalkar, Ian McGraw, Raziel Alvarez, Ding Zhao, David Rybach, Anjuli Kannan, Yonghui Wu, Ruoming Pang, et al., “Streaming endtoend speech recognition for mobile devices,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 6381–6385.
 [3] Khe Chai Sim, Petr Zadrazil, and Françoise Beaufays, “An investigation into ondevice personalization of endtoend automatic speech recognition models,” in Interspeech, 2019.
 [4] Khe Chai Sim, Françoise Beaufays, Arnaud Benard, Dhruv Guliani, Andreas Kabel, Nikhil Khare, Tamar Lucassen, Petr Zadrazil, Harry Zhang, Leif Johnson, Giovanni Motta, and Lillian Zhou, “Personalization of endtoend speech recognition on mobile devices for named entities,” to appear in ASRU, 2019.
 [5] Jian Xue, Jinyu Li, Dong Yu, Mike Seltzer, and Yifan Gong, “Singular value decomposition based lowfootprint speaker adaptation and personalization for deep neural network,” in Proc. ICASSP. IEEE, 2014, pp. 6359–6363.
 [6] Yong Zhao, Jinyu Li, and Yifan Gong, “Lowrank plus diagonal adaptation for deep neural networks,” in Proc. ICASSP. IEEE, 2016, pp. 5005–5009.
 [7] Lahiru Samarakoon and Khe Chai Sim, “Factorized hidden layer adaptation for deep neural network based acoustic modeling,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 12, pp. 2241–2250, 2016.
 [8] JC Nash, “The singularvalue decomposition and its use to solve leastsquares problems,” Compact Numerical Methods for Computers: Linear Algebra and Function Minimisation, pp. 30–48, 1990.
 [9] James L McClelland, David E Rumelhart, PDP Research Group, et al., Parallel distributed processing, vol. 2, MIT press Cambridge, MA:, 1987.

[10]
Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton,
“On the importance of initialization and momentum in deep learning,”
in International conference on machine learning, 2013, pp. 1139–1147.  [11] Diederik P. Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” in 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 79, 2015, Conference Track Proceedings, 2015.
 [12] Alex Graves, “Sequence transduction with recurrent neural networks,” arXiv preprint arXiv:1211.3711, 2012.
 [13] Tom Bagby, Kanishka Rao, and Khe Chai Sim, “Efficient implementation of recurrent neural network transducer in TensorFlow,” in 2018 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2018, pp. 506–512.
 [14] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Gregory S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian J. Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Józefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Gordon Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul A. Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda B. Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng, “Tensorflow: Largescale machine learning on heterogeneous distributed systems,” CoRR, vol. abs/1603.04467, 2016.
Comments
There are no comments yet.