Low-rank Gradient Approximation For Memory-Efficient On-device Training of Deep Neural Network

01/24/2020
by   Mary Gooneratne, et al.
Google
Duke University
0

Training machine learning models on mobile devices has the potential of improving both privacy and accuracy of the models. However, one of the major obstacles to achieving this goal is the memory limitation of mobile devices. Reducing training memory enables models with high-dimensional weight matrices, like automatic speech recognition (ASR) models, to be trained on-device. In this paper, we propose approximating the gradient matrices of deep neural networks using a low-rank parameterization as an avenue to save training memory. The low-rank gradient approximation enables more advanced, memory-intensive optimization techniques to be run on device. Our experimental results show that we can reduce the training memory by about 33.0 optimization. It uses comparable memory to momentum optimization and achieves a 4.5

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

09/08/2020

Low-Rank Training of Deep Neural Networks for Emerging Memory Technology

The recent success of neural networks for solving difficult decision tal...
08/02/2017

ProjectionNet: Learning Efficient On-Device Deep Networks Using Neural Projections

Deep neural networks have become ubiquitous for applications related to ...
10/06/2015

Structured Transforms for Small-Footprint Deep Learning

We consider the task of building compact deep learning pipelines suitabl...
08/24/2021

Greenformers: Improving Computation and Memory Efficiency in Transformer Models via Low-Rank Approximation

In this thesis, we introduce Greenformers, a collection of model efficie...
10/02/2018

Training compact deep learning models for video classification using circulant matrices

In real world scenarios, model accuracy is hardly the only factor to con...
09/20/2018

Data Shuffling in Wireless Distributed Computing via Low-Rank Optimization

Intelligent mobile platforms such as smart vehicles and drones have rece...
05/26/2020

Explore Training of Deep Convolutional Neural Networks on Battery-powered Mobile Devices: Design and Application

The fast-growing smart applications on mobile devices leverage pre-train...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

State-of-the-art speech-recognition models are based on deep neural networks [1] with weight matrices of dimensions in the order of thousands. We have shown that such models can be deployed offline on mobile devices [2]. Decentralizing the training of these models to be on-device can improve personalization and security. However, the advanced optimization techniques used to train models require additional memory proportional in size to the model parameters. Therefore, one of the major obstacles to achieving high-accuracy on-device models is the memory limitation of devices for training. Previous explorations to reduce training memory included training only on part of the model and/or splitting the gradient computation into multiple steps [3, 4].

In this paper, we propose using low-rank gradient approximation to reduce the training memory needed for advanced optimization techniques. Note that there are already methods in the literature that use low-rank structure to reduce model size [5, 6, 7]

. This proposal does not apply the low-rank approximation to the weight matrices but rather to the gradients, as to retain the full modeling power of the model. The approach we take is less computationally expensive than Singular Value Decomposition (SVD) 

[8] and, again, does not constrain the model parameters by using the approximation exclusively as a vehicle for gradient computation.

The remainder of the paper is organized as follows. Section 2 describes several optimization techniques and their associated training memory. Section 3 presents the proposed low-rank gradient optimization method. Section 4 analyses the effects of low-rank gradient approximation on training speed and convergence. Section 5 presents the experimental results on a personalization tasks for on-device speech recognition.

2 Parameter Optimization

Deep neural networks are optimized by minimizing a loss function, which is a nonlinear function of the model parameters. This is done iteratively by updating the parameters in a direction that reduces the loss function. Let’s denote the loss by

, where is the weight matrix to be updated. The update formula can be computed as:

(1)

where and are the weight matrix and the corresponding update at training step . For gradient descent optimization, the update direction is given by the negative of the gradient of the loss with respect to the weight matrix:

(2)

where is the learning rate and is the gradient at , which can be computed using error back propagation [9]. There are more advanced optimization techniques that compute a better update direction to improve training convergence. For example, the update direction for momentum optimization [10] is recursively computed as follows:

(3)

where is the momentum coefficient. Additional memory is required to save (same size as ) for the subsequent training step (doubling the model size). For Adam optimization [11], the update direction is given by:

(4)

where , and are scalar parameters. The first and second momentum terms, and , are computed recursively as:

(5)
(6)

The symbols and denote the element-wise multiplication and division operators. Two additional terms are introduced, which results in a memory requirement that is 3 times the size of the original model.

3 Low-rank Gradient Approximation

As shown in the previous section, advanced optimization techniques require more memory to store additional terms. To reduce the total amount of memory required, we propose using low-rank gradient approximation. Although low-rank approximation can be achieved using Singular Value Decomposition (SVD) [8], applying SVD to the gradients for each training step is computationally expensive. Instead, we propose to re-parameterize the weight matrix into two parts:

(7)

where is an unconstrained matrix with the same size as , and is a low-rank matrix of rank . If is a matrix, then and are matrices of sizes and , respectively (, ). With the re-parameterization in Eq. 7, we can reduce training memory by keeping fixed and updating only and . However, this leads to a low-rank model, where the model parameter space is constrained to be low rank. In order to keep the model parameters unconstrained, we treat as the actual model parameters, and use and only for the purpose of gradient computation. Therefore, the update of is constrained to be low-rank by those of and such that the effective gradient of is given by:

(8)

The gradients of and can be computed from the gradient of as follows:

(9)
(10)

By substituting Eq. 9 and 10 into Eq. 8 we can rewrite the effective gradient of as:

(11)

where and are the low-rank projections of the rows and columns of , respectively.

It is useful to note that we can compute from the original model and then compute the gradients for and using Eq. 9 and 10. The re-parameterization in Eq. 7 needs not be explicitly applied to the model (i.e. no need to modify the model computational graph). Instead, it can be applied by post-processing the gradient. This makes it easy to apply low-rank gradient training to existing models.

For the case of gradient descent, the update direction is given by the negative of the gradient scaled by the learning rate (Eq. 2). From Eq. 8, we get

(12)

and the corresponding change in the loss function (ignoring the higher-order terms):

(13)

where we define for clarity. With the cyclic invariance of the trace, we can express the terms inside the trace as positive semi-definite matrices. This will result in a non-positive change to the loss (), as and the trace of a positive semi-definite matrix is non-negative. Note that in the unrestricted case (without low-rank projection), the change in loss is given by:

(14)

In the special case where and are orthogonal matrices, the projection matrices and are diagonal matrices with elements 1 or 0. In fact, they are rank- matrices with exactly entries of 1’s on the leading diagonal. A smaller will result in a smaller trace term in Eq. 13, and therefore a smaller reduction in loss. As a result, we expect low-rank approximation to slow down training convergence.

3.1 Random Gradient Projection

From Eq. 9 and 10, if and are zero matrices, their gradients will also be zero. Therefore, we need to assign non-zero values to and at each training step so that they can be updated. Ideally, we want to choose and to maximize the magnitude of in Eq. 13

using SVD. However, this will be computationally expensive. Instead, we assign them with random values. It can be shown that by drawing random values from a zero-mean normal distribution with standard deviations of

and for and , respectively, and are close to an identity matrix ( and are approximately orthogonal). Furthermore, by comparing between Eq. 13 and 14, it is desirable to choose and

such that the eigenvalues of

and are close to . This can be accomplished by drawing the values of and from and , respectively.

3.2 Implementations

The low-rank gradient approximation method described above can be implemented by adding new variables and for each weight matrix in the model. Note that constraining the gradient of to be low-rank (using Eq. 11) does not necessarily yield a low-rank momentum term (e.g Eq. 6). Instead, it is easier to keep track of the momentum terms by updating and separately and use the updated and to update using Eq. 7. If and are updated by and respectively, the effective update of is given by

(15)
(16)

Note the additional second-order term () in Eq. 16 compared to Eq. 12. This way, we are able to combine low-rank gradient approximation with existing advanced optimization techniques. In fact, in Eq. 16 can be rewritten as:

(17)

That is, the update direction is given by the difference between the new and old low-rank matrix, .

1:procedure LowRankUpate()
2:      # Randomize and .
3:     .
4:     .
5:      # Compute gradients.
6:      (using Eq. 9)
7:      (using Eq. 10)
8:      # Update and .
9:     
10:     
11:      # Update
12:      (using Eq. 17)
Algorithm 1 Low-rank Gradient Approximation Algorithm

The algorithm for computing the low-rank gradients is shown in Algorithm 1. For each training step, we first assign random values to and (lines 3 and 4). Next, we compute the gradients for and (lines 6 and 7) and updates and using standard optimization techniques, such as gradient descent, momentum, and Adam (lines 9 or 10). Finally, in line 12, we update using Eq. 17.

4 Analysis

We set up a simple problem to analyze and understand the behaviour of the proposed low-rank gradient method. The goal is to learn a matrix to match a target matrix, . The following mean squared error loss function is used:

(18)

where and are matrices of size . The -th element of and are denoted by and , respectively. We use to introduce non-linearity to the function. We compared using different optimization methods and low-rank projection methods. none means that there is no low-rank gradient approximation, random refers to the case where and are randomly set per training step and svd means that and

are estimated by approximating the gradient of

using SVD. We performed 50,000 training steps for each configuration.

Optimization Projection   Loss Training time
Method Method (seconds)
Gradient Descent none 0.00510 22.4
random 1.73146 24.0
svd 0.44033 329.2
Momentum none 0.00059 20.2
random 1.72933 29.5
svd 0.31221 347.2
Adam none 0.00026 28.0
random 0.00008 33.2
svd 0.00028 317.6
Table 1: Comparing loss and training time after 50,000 training steps for different optimization methods (, ).

Table 1 shows the loss and training time after 50,000 training steps. In general, svd approximation yields a much better loss value after 50,000 training steps across different optimization methods, except for Adam optimization where all methods converged to a loss value of less than . On the other hand, the random method is only slightly slower than the standard method while the svd method takes an order magnitude longer time to train (due to the need to computed SVD every training step).

5 Experimental Results

We collected a dataset we called Wiki-Names [4] to evaluate the performance of speech personalization algorithms. The text prompts are sentences extracted from English Wikipedia pages that contain repeated occurrences of politician and artist names that are unique and difficult to recognize (we selected them by synthesizing speech for these names and verifying that our baseline recognizer recognizes them incorrectly).

The dataset aggregates speech data from 100 participants. Each participant provided 50 utterances (on average 4.6 minutes) of training data and 20 utterances (on average 1.9 minutes) of test data. The prompts for each user covered five names, each with 10 training utterances and 4 test utterances, with each name potentially appearing multiple times per utterance. The dataset includes accented and disfluent speech.

We used the Wiki-Names dataset for personalization experiments. The baseline ASR model is a recurrent neural network transducer (RNN-T) 

[12] as described in [2]. The models were trained using the efficient implementation [13]

in TensorFlow 

[14]. We measured the success of the modified model using the word error rate (WER) metric as well as the name recall rate [4] as described below:

(19)

where retrieved is the number of times names are present in the hypotheses and relevant refers to the number of times names appear in the reference transcripts. retrieved relevant indicates the number of relevant names that are correctly retrieved.

In addition to tracking quality metrics, we also quantify the impact of the algorithms on training memory by running on-device benchmarks. Comparisons were made between different parameterization ranks, different optimizers, and full-rank, baseline model.

5.1 Memory Benchmark

Figure 1: Comparison of peak training memory (in megabytes) used for Momentum and Adam optimizers with different ranks.

The low-rank gradient model saved a significant amount of memory. Figure 1 shows the training memory with low-rank gradient projection versus the baseline (full rank) models, for both the momentum and Adam optimization methods. We adjusted the rank of the gradient projection matrix across experiments to observe the impact on memory. Figure 1 shows that the low-rank model uses less memory than the full-rank model using the momentum optimizer for a projection of rank 100. Any projection of a lower rank would continue to save memory. Similarly, the modified model was able to save training memory with the Adam optimizer for a projection of up to rank 200. Additionally, the graph illustrates that the training memory increases about linearly with rank. Furthermore, with rank 100 and 150, low-rank gradient projection with Adam optimization consumes about the same memory as full rank momentum optimization.

5.2 Speech Recognition Performance

Figure 2: Comparison of word error rate performance for Momentum and Adam optimization with and without low-rank approximation.

The results in Figures 2 and 3 show how the speech recognition quality varies with increasing training steps for different settings. Figure 2 compares the WER for the momentum and Adam optimization methods with and without low-rank projection. Comparing the low-rank and full-rank models, Figure 2 shows that low-rank models converge slower for both momentum and Adam optimization. The latter achieved a better performance, indicating that we are able to take advantage of the benefit of Adam optimization by using it to update and .

Figure 3: Comparison of word error rate performance for Adam optimization with different ranks.

Figure 3 shows that the word error rate decreases faster and to a lower rate as the rank of the gradient approximation is increased. Training the model using Adam with a gradient projection matrix of rank 150 reached a word error rate of 47.1% while the baseline model converges at a word error rate about 43.8%.

Figure 4: Comparison of name recall rate performance for Adam optimization with different ranks.

Similarly, Figure 4 shows that the name recall rate increases faster and to a higher rate with the higher rank models, as expected.

6 Summary

The experiments detailed above sought to explore an opportunity to save training memory for deep neural network models, such as those used for speech recognition. Approximating the gradient computation using low-rank parameters saves memory up to a rank of about 100 and 200 for the momentum and Adam optimization, respectively. These results are promising.

To observe the impact of the new training method on the effectiveness of the model, we did experiments on Wiki-Names, a dataset with accented speech and difficult-to-recognize names. The most important metrics from these experiments are the word error rate and the recall for the names in the dataset. We compared how models of different ranks and different optimizers trained and how their training compared to the baseline model. For the model using the momentum optimizer, the rank did not impact training significantly. Furthermore, the low-rank model for momentum performed worse than the baseline momentum model. Predictably, the low-rank model with the Adam optimizer performed much better. Additionally, rank had an observable impact on training.

Using a low-rank approximation of the gradient computation for deep neural network models provides an opportunity to save memory without a significant increase in error rate or decrease in recall rate. This opportunity is most promising for on device training with more advanced optimizers, like Adam, that traditionally use multiple high-dimensional parameters for gradient computation.

References