Scaling Back-propagation by Parallel Scan Algorithm

07/23/2019 ∙ by Shang Wang, et al. ∙ UNIVERSITY OF TORONTO 0

In an era when the performance of a single compute device plateaus, software must be designed to scale on a massively parallel system for better runtime performance. However, the commonly used back-propagation (BP) algorithm imposes a strong sequential dependency in the process of gradient computation. Under model parallelism, BP has a theoretical step complexity of Θ (n) which hinders its scalability in a parallel computing environment, where n represents the number of compute devices into which a model is partitioned. In this work, we restructure such dependency and reformulate BP into a scan operation which is scaled by our modified version of the Blelloch scan algorithm. Our algorithm is able to achieve a theoretical step complexity of Θ ( n). We perform an in-depth performance analysis and identify the challenges of deploying our algorithm in a practical setting, along with a variety of approaches to tackle such challenges. We demonstrate the scalability benefits of our algorithm in the use case of retraining pruned networks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The training of deep learning models demands more and more compute resources as the models become more powerful and complex with an increasing number of layers in recent years

Krizhevsky et al. (2012); Szegedy et al. (2015); Simonyan and Zisserman (2015); He et al. (2015); Huang et al. (2016). For example, ResNet can have more than a thousand layers He et al. (2016) and take days to train on the state-of-the-art GPUs Coleman et al. (2017). When the performance of a single compute device plateaus Esmaeilzadeh et al. (2011), training has to be designed to scale on a massively parallel system. Model parallelism Krizhevsky (2014); Harlap et al. (2018); Huang et al. (2018) is one approach to scale training by partitioning a model and distributing its parts among multiple devices. This is the common approach when a model cannot fit into one device due to memory constraints (caused by, for example, deep network architecture, large batch size, or high input resolution Rhu et al. (2016); Zhu et al. (2018)). However, training under model parallelism does not scale well with the number of devices. The fundamental reason for such an scalability issue is that the back-propagation algorithm (BP) Rumelhart et al. (1988) imposes a strong sequential dependency between layers during the gradient computation, which effectively allows at most one device to be utilized at any given point in time. This dependency prevents BP from being efficiently deployed into a distributed computing environment. Therefore, we seek to address this fundamental limitation by restructuring the dependency so that BP can be scaled by parallel algorithms. In summary, we make several major contributions:

  • We reformulate BP as a scan Blelloch (1990) operation, and modify the Blelloch scan algorithm Blelloch (1990) to efficiently scale BP in a distributed computing environment. Our method has a theoretical step complexity of , where represents the number of devices into which a model is partitioned, compared to of the naïve implementation of model parallelism.

  • We perform an in-depth runtime analysis of our algorithm, identify the challenges in applying it in a practical setting, and propose a new and efficient approach to tackle these challenges using sparse Jacobian matrices. We then develop routines to efficiently generate transposed Jacobian in sparse format for various operators.

  • By projecting the performance of our method, we demonstrate that the retraining of pruned networks Han et al. (2015); See et al. (2016); He et al. (2017) is a practical use case where our method can achieve better scaling compared to the original BP.

2 Background and Prior Work

To increase the utilization of hardware resources with model parallelism, prior works (e.g., PipeDream Harlap et al. (2018) and GPipe Huang et al. (2018)) propose to pipeline the computation in the forward and backward passes across devices. However, these solutions are not “silver bullets” to scalability for the following reasons. First, both PipeDream Harlap et al. (2018) and GPipe Huang et al. (2018) require storing multiple versions of weights and activations for all mini-batches that enter the pipeline. Thus, the memory consumption grows linearly with the length of the pipeline. As a result, the maximum number of devices that can be supported is limited by the memory capacity of a single device (such as the GPU global memory). Second, if the parameter updates are not fully synchronized as proposed in PipeDream Harlap et al. (2018), staleness can be introduced. Although Harlap et al. argue that the staleness produced by their method does not affect the update step for a vanilla SGD optimizer Harlap et al. (2018), such argument would be invalid when combined with other techniques commonly used in first-order optimizers (such as momentum in Adam Kingma and Ba (2015)). Otherwise, if the gradient updates are fully synchronized as proposed in GPipe Huang et al. (2018), the “bubble of idleness” between the forward and backward passes increases linearly with the length of the pipeline, thus linearly reducing the hardware utilization and decimating the original purpose of pipelining.

Our approach fundamentally differs from these key prior works Harlap et al. (2018); Huang et al. (2018) in the following ways. First, instead of following the dependency of BP, we reformulate BP so that scaling is achieved via the Blelloch scan algorithm Blelloch (1990)

which is designed for parallelism. Second, the original BP is reconstructed exactly, so that estimation errors such as staleness do not exist; therefore, our method is agnostic to the exact first-order optimizer being used. Third, our approach becomes more advantageous as the number of devices increases, instead of diminishing returns or hitting scalability limits.

3 Problem Formulation and Proposed Method

[width=]imgs/mlp.JPG

Figure 1: A visualization of the formulation in Section 3

on convolutional neural networks. Different parts of the model can be distributed to different devices (workers).

We conceptualize a model as a vector function

composed of sub-functions :

where are the parameters of the model. The model is evaluated by an objective function , where is the initial input to the model. This formulation on convolutional neural networks can be visualized in Figure 1.

To train the model , a first-order optimizer requires the gradients which are derived from the gradients :

(1)

where is the Jacobian matrix of the output of to its parameters . To derive given , BP Rumelhart et al. (1988) solves the following recursive equation, from to :

(2)

where is the Jacobian matrix of the output of to its input . Equation 1 itself does not have dependency along ; therefore, the computation of can be parallelized if are available. However, Equation 2 imposes a strong sequential dependency along where the computation of can not begin until the computation of finishes, and therefore, hinders the scalability when multiple workers (as an abstraction of devices) are available.

3.1 Back-propagation as a Scan Operation

We define a binary, associative and non-commutative operator

whose identity value is the identity matrix

, where can be either a matrix or a vector and is a matrix. Using operator , we can reformulate Equation 2 as calculating the following array:

(3)

Based on the definition proposed by Blelloch Blelloch (1990), Equation 3 can be interpreted as a scan (all-prefix-sums) operation of on the input array:

(4)

3.2 Scaling Back-propagation with the Blelloch Scan Algorithm

1: Input array of Equation 4
2: needed for Equation 1; computed in-place
3:for  to  do Up-sweep Phase
4:       for all to by do in parallel
5:             
6:             
7:       end for
8:end for
9:
10:for  to  do Down-sweep Phase
11:       for all to by do in parallel
12:             
13:             
14:             
15:              Modification: The left and right operands of are reversed.
16:       end for
17:end for
Algorithm 1 Modified Blelloch Scan Algorithm

[width=]imgs/blelloch_vgg11.JPG

Figure 2: Applying our algorithm on the convolution layers of VGG-11 Simonyan and Zisserman (2015). Blue, orange, and green squares represent transposed Jacobian matrices, gradient vectors, and symbolic identity matrices respectively. Blue solid lines, orange solid lines, and yellow dash lines represent matrix-matrix multiplications, matrix-vector multiplications, and logical data movements (that do not have to be performed explicitly) respectively. All of the inputs, outputs and intermediate results are marked with tags that correspond to the plots in Section 4.3.

We can parallelize the computation of Equation 3 on multiple workers with the Blelloch scan algorithm Blelloch (1990), formally described in Algorithm 1. The algorithm contains two phases: up-sweep and down-sweep. Figure 2 visualizes this algorithm applied on the convolution layers of VGG-11 Simonyan and Zisserman (2015) with levels L0-L4 as the up-sweep and levels L5-L10 as the down-sweep. Only the up-sweep phase contains matrix-matrix multiplications. Due to the non-commutative property of the operator , we have to reverse the operands of during the down-sweep phase. This modification is reflected on line 13 of Algorithm 1 and visualized in Figure 5.

3.3 Jacobian Matrices in Sparse Format

A full Jacobian matrix of can be too expensive to generate, store, and process. In fact, the Jacobian matrix of the first convolution operator in VGG-11 Simonyan and Zisserman (2015) processing a

image can occupy 768 MB of memory if it is stored as a full matrix, which is prohibitively large. Fortunately, Jacobian matrices of certain operators (such as convolution, ReLU, and max-pooling) can be very sparse as shown in Figure 

5. In our implementation, the transposed Jacobian matrices are represented in the Compressed Sparse Row (CSR) Saad (2003) format due to existing support of CSR matrix multiplication routines in sparse matrix libraries such as cuSPARSE Corporation (2018a). Representing the data contained in the same Jacobian of the aforementioned convolution operator in CSR shrinks the memory consumption down to only 6.5 MB. Two types of zeros appear in an operator’s Jacobian: guaranteed zeros that are input invariant (e.g., zeros outside of the diagonal of the ReLU’s Jacobian); and possible zeros that are input dependent (e.g., zeros on the diagonal of the ReLU’s Jacobian). For any operator, the pattern of the distribution for guaranteed zeros (named as the sparsity pattern for brevity) in the Jacobian is deterministic with the model architecture and known ahead of time. Thus, this information can be used to build more efficient sparse matrix multiplication routines than the generic ones in conventional sparse matrix libraries (e.g., cuSPARSE). The sparsity of guaranteed zeros (defined as the fraction over all elements in a matrix) for various operators is listed in Table 1.

[width=0.3]imgs/up_sweep.JPG [width=0.3]imgs/down_sweep.JPG
Figure 3: Up-sweep phase: .
Figure 4: Down-sweep phase: .
[width=]imgs/conv2d_jcb (a) Convolution [width=0.5]imgs/jcb_maxpool (b) Max-pooling. [width=]imgs/jcb_relu (c) ReLU.
Figure 5: Transposed Jacobian for various operators. Yellow dots represents locations of non-zero elements in the matrix.
Operator Filter/Kernel Size Input Size Output Size Sparsity Examples
Convolution 0.99157
ReLU N/A 0.99998
Max-pooling 0.99994
  • The examples of sparsity for the first convolution, ReLU and max-pooling operators of VGG-11 Simonyan and Zisserman (2015) operating on images are shown in the last column of the table.

  • Approximation when and

    are much greater than the padding size.

Table 1: The sparsity of guaranteed zeros for various operators.

3.4 Generating Jacobian Matrix in CSR Analytically

To avoid impractically inefficient methods of generating the Jacobian of an operator one column at a time (either numerically or via automatic differentiation Paszke et al. (2017)), we develop analytical routines to generate the transposed Jacobian directly into the CSR format.

Convolution

For the transposed Jacobian of a convolution operator with a filter and a padding size of 1 on all sides of the height and width, our methods of generating the CSR indptr, indices and data arrays Varoquaux et al. ([n. d.]) are formally described in Algorithm 2, Algorithm 3 and Algorithm 4 respectively.

1:input channels , output channels , input height , input width
2:
3:for all to do in parallel
4:       
5:       
6:       if  then
7:             
8:       else if  then
9:             
10:       else
11:             
12:       end if
13:end for
Algorithm 2 Compute the CSR indptr array for the transposed Jacobian of a convolution.
1:input channels , output channels , input height , input width , indptr computed from Algorithm 2
2:
3:for all to do in parallel
4:       
5:       
6:       for all to do in parallel
7:             for all to do in parallel
8:                    
9:             end for
10:       end for
11:       if  or  then
12:             
13:              if ; otherwise
14:             for all to do in parallel
15:                    
16:             end for
17:       else
18:             
19:       end if
20:       
21:end for
Algorithm 3 Compute the CSR indices array for the transposed Jacobian of a convolution.
1:input channels , output channels , input height , input width , filter weights, indptr computed from Algorithm 2
2:
3:for all to do in parallel
4:       
5:       
6:        if ; if ; otherwise
7:       
8:       Fix corner cases when or .
9:end for
Algorithm 4 Compute the CSR data array for the transposed Jacobian of a convolution.

ReLU

Our methods of generating the CSR indptr, indices and data arrays Varoquaux et al. ([n. d.]) for the transposed Jacobian of a ReLU operator are formally described in Algorithm 5, Algorithm 6 and Algorithm 7 respectively.

1:size

of the (flattened) input tensor

2:
3:for all to do in parallel
4:       
5:end for
Algorithm 5 Compute the CSR indptr array for the transposed Jacobian of ReLU.
1:size of the (flattened) input tensor
2:
3:for all to do in parallel
4:       
5:end for
Algorithm 6 Compute the CSR indices array for the transposed Jacobian of ReLU.
1:the (flattened) input tensor , and its size
2:
3:for all to do in parallel
4:       if  then
5:             
6:       else
7:             
8:       end if
9:end for
Algorithm 7 Compute the CSR data array for the transposed Jacobian of ReLU.

Max-pooling

Assuming the stride size and the window size are the same, and we can access a tensor (named as

pool_indices for brevity) which specifies the indices of the elements in the input tensor that are “pooled” for the output tensor (documented in pyt ([n. d.])), our methods of generating the CSR indptr, indices and data arrays Varoquaux et al. ([n. d.]) are formally described in Algorithm 8, Algorithm 9 and Algorithm 10 respectively.

1:pool_indices, input height , input width , output height , output width , output channels
2:,
3:for all to do in parallel
4:       
5:end for
6:for all to do in parallel
7:       for all to do in parallel
8:             for all to do in parallel
9:                    
10:                    
11:                    
12:             end for
13:       end for
14:end for
15:
16:for  to  do
17:       
18:       if  then
19:             
20:       end if
21:end for
22:
Algorithm 8 Compute the CSR indptr array for the transposed Jacobian of max-pooling.
1:mapping computed from Algorithm 8, input height , input width , output height , output width , output channels
2:
3:
4:for  to  do
5:       if  then
6:             continue
7:       end if
8:       
9:       
10:end for
Algorithm 9 Compute the CSR indices array for the transposed Jacobian of max-pooling.
1:output channels , output height , output width
2:
3:for all to do in parallel
4:       
5:end for
Algorithm 10 Compute the CSR data array for the transposed Jacobian of max-pooling.

4 Evaluation

We evaluate our method against two baselines: (1) the stronger baseline of PyTorch Autograd Paszke et al. (2017) which is a widely adopted implementation of BP; and (2) the weaker baseline of linear scan which is emulating BP by using the transposed Jacobian in CSR and multiplying with the gradient (as shown in Equation 2) explicitly via sparse matrix multiplication routines.

4.1 Convergence

Theoretically, our algorithm is a reconstruction of BP instead of an estimation, and hence, expected to reproduce the exact same outputs. However, in practice, numerical differences could be introduced due to the change in the order of matrix multiplications. We apply our algorithm to train LeNet-5 Lecun et al. (1998) on CIFAR-10 Krizhevsky (2009) to demonstrate that such numerical differences would not affect model convergence. We use a mini-batch size of 256 and the SGD Qian (1999) optimizer with a learning rate of 0.001 and a momentum of 0.9. The result plotted in Figure 6 shows that our algorithm has negligible impact on the convergence compared to the original BP.

[width=0.7]imgs/train_loss

(a) Training loss per iteration.

[width=0.7]imgs/test_loss

(b) Test loss per iteration.
Figure 6: Training and test loss per iteration (sampled at every 30th iteration for cleanliness of the figures) for training LeNet-5 Lecun et al. (1998) on CIFAR-10 Krizhevsky (2009). The orange solid lines represent training via the PyTorch Autograd Paszke et al. (2017) baseline, while the blue dash lines represent training via our method. Both runs are seeded with the same value. The orange lines overlap with the blue lines for both plots.

4.2 Complexity Analysis

We leverage the following definitions to quantify the complexity of a parallel algorithm: (1) step complexity () which evaluates the number of steps on the critical path; (2) per-step complexity () which evaluates the runtime of a single step; and (3) work complexity () which evaluates the number of total steps executed by all workers. Assuming the system can be conceptualized as a parallel random-access machine (PRAM) Kruskal et al. (1990), the number of workers is and the size of the input array in Equation 4 is , the step and work complexity of our algorithm can be derived as:

(5)
(6)

compared to of the linear scan (which has the same step and work complexity as BP). Therefore, in an ideal scenario where there is an unbounded number of workers with unit per-step complexity, our algorithm reduces the latency from to . If, however, we consider the difference in per-step complexity between our algorithm () and the baseline () due to runtime difference between matrix-matrix and matrix-vector multiplications, our algorithm has a latency of compared to in the baseline. There are two approaches to make our algorithm achieve a lower latency and better scaling than the baseline. First, we can reduce , which is reflected in leveraging the CSR sparse matrix multiplication routines analyzed in Section 4.3. Second, without a lower , our algorithm can still outperform the baseline if . This can occur when grows larger than the dimension of . Possible use cases include computing inverse kinematics Marschner and Shirley (2016) of a long skeleton (such as simulating the movement of a snake) found in 3D animation.

4.3 Microbenchmark Analysis

We prototype our algorithm with cuSPARSE Corporation (2018a) on a RTX 2070 GPU (Turing architecture) to perform sparse matrix-matrix multiplications (csrgemm) and apply on the convolution layers of VGG-11 Simonyan and Zisserman (2015) (visualized in Figure 2) for training on CIFAR-10 Krizhevsky (2009). The latency of every step (named per-step latency for brevity) is derived by averaging the measurements via CUDA Events Corporation ([n. d.]) from 50 trials after 5 warm-up runs. Each trial performs the backward pass for a single iteration of a single image sample. The goal is to reduce the per-step latency of our method to be below the maximum per-step latency of the baselines, which is measured under the same settings. Figure 6(a) shows the corresponding result. We can observe that many steps of sparse matrix-matrix multiplications meet the weaker baseline of the linear scan; however, a few have significantly higher per-step latencies.

Since the sparsity pattern can be determined ahead of time from the model architecture, the values in the CSR indptr and indices arrays Varoquaux et al. ([n. d.]) stay the same across batches and iterations. Therefore, the process of joining indptr and indices arrays in a generic csrgemm routine Bank and Douglas (1993) could be omitted, and functions that transform the input data arrays to the output data array for each step can be generated and compiled a priori before the training begins, potentially reaching better performance than cuSPARSE Corporation (2018a). Considering the possibility of a better implementation, we project the lower bound of the per-step latency by measuring the number of floating-point operations (FLOP) performed on the CSR data arrays for each step and dividing it by the theoretical FP32 throughput of RTX 2070 Corporation (2018b). The result is shown in Figure 6(b), from which we can observe that most steps meet the stronger baseline of PyTorch Autograd Paszke et al. (2017); however, three matrix-matrix multiplications have latencies significantly higher than others. For every matrix-matrix multiplication, the number of non-zero elements (interpreted as normalized by the size of the matrix) of the left and right operand matrices are plotted in Figure 9. We can observe that the sparsity helps reduce as expected, and the three performance anomalies are caused by the loss of sparsity from the operands.

From Algorithm 4, we observe that the values in the Jacobian of a convolution operator only depend on the filter weights. Thus, pruning the weights can lead to a higher sparsity in the Jacobian, which would reduce the per-step latency of sparse matrix-matrix multiplications. Nevertheless, despite recent advances of network pruning algorithms Han et al. (2015); See et al. (2016); He et al. (2017), there is no existing widely adopted software or hardware platform that can capitalize on pruning, as most techniques are evaluated through “masking simulation” which leads to the same (if not worse due to masking) latency and memory usage. Therefore, we identify the retraining of pruned networks as a strong use case that can benefit from our algorithm. We demonstrate such impact through an experiment on VGG-11 Simonyan and Zisserman (2015): training on CIFAR-10 Krizhevsky (2009); pruning away of the weights in all convolution and linear operators using the technique proposed by See et al. See et al. (2016); and retraining the pruned network. We choose this pruning percentage so that a similar validation accuracy can be reached ( v.s.

) after retraining for the same number of epochs (

) as training. The lower bound projection on the per-step latency is derived via the same aforementioned method. Based on the result shown in Figure 6(c), we can observe that all of the steps reach (with one anomaly that is close to) the stronger baseline of PyTorch Autograd Paszke et al. (2017).

[width=]imgs/plot_full_blelloch_measure (a) Measuring on a full VGG-11 Simonyan and Zisserman (2015). [width=]imgs/plot_full_blelloch_symbolic (b) Projecting on a full VGG-11. [width=]imgs/plot_prune_blelloch_symbolic (c) Projecting on a pruned VGG-11.
Figure 7: Measuring or projecting the latency for each step. The orange and blue circles represent matrix-vector and matrix-matrix multiplications respectively. A filled circle indicates that the step is on the critical path. The red and blue dash lines represent the maximum per-step latency of the linear scan and PyTorch Autograd Paszke et al. (2017) baselines respectively. The x-axis represents the theoretical complexity of the matrix-matrix or matrix-vector multiplications of the step if the transposed Jacobians were full matrices. Each circle is annotated with the tag of the product corresponding to Figure 2.
[width=]imgs/plot_5_full_blelloch_symbolic
Figure 8: The number of non-zeros in the (left, right) operands for all matrix-matrix multiplications and the corresponding projected latencies.
[width=]imgs/full_sparsity (a) The full model. [width=]imgs/prune_sparsity (b) The pruned model.
Figure 9: Sparsity of all transposed Jacobian (inputs and intermediate results) in the up-sweep phase. Filled circle represents the transposed Jacobian of convolutions. The x-axis represents the matrix tags corresponding to Figure 2.

5 Discussion

The runtime overhead added in the forward pass from generating the transposed Jacobian in CSR is not a primary concern because: for input-dependent Jacobian (such as ReLU and max-pooling), the number of non-zero elements is at most the input size , which leads to extremely high sparsity and negligible overhead; and for input-invariant Jacobian (such as convolution), the generation can be performed anytime during one iteration and thus removed from the critical path, and the runtime can be further amortized by a larger batch size.

The sparsity loss in the transposed Jacobian matrices comes from two possible sources. First, based on Table 1, when the ratio of increases for convolutions, the sparsity of the Jacobian reduces. This can occur when the activation passes through max-pooling, and the height and width reduce by the stride factor (half in the case of VGG-11 Simonyan and Zisserman (2015)). This phenomenon is demonstrated in Figure 8(a) with the same experiment in Section 4.3. This issue is less limiting for retraining pruned networks (Figure 8(b)). Second, after the multiplication of two sparse matrices, the product matrix can have more non-zero elements than the operand matrices. Therefore, the sparsity can reduce as the up-sweep phase progresses to deeper levels. This can be observed from Figure 8(b) by comparing with previous levels. This challenge can be mitigated by balancing the number of levels that the up-sweep phase progresses with the sparsity of the product matrices at each level.

Another approach to reduce the per-step complexity for sparse matrix-matrix multiplication is to use Monte Carlo Approximate Matrix Multiplication Drineas et al. (2006). However, a trade-off between the convergence rate and the per-step latency has to be made, as a lower per-step latency can be coupled with an inaccurate matrix product which leads to noticeable errors in the resulting gradient vector .

6 Conclusion

This paper explores a new direction to scale BP. We reformulate BP into a scan operation which is scaled by our modified version of the Blelloch scan algorithm. While achieving superior scalability in terms of step complexity, per-step complexity increases due to multiplications of transposed Jacobian matrices during the up-sweep phase, as opposed to only multiplications of transposed Jacobian and gradient vectors in the baseline. We develop methods to efficiently generate the transposed Jacobian of various operators directly in CSR, and we leverage the CSR matrix multiplication routine (csrgemm) to reduce the per-step complexity. While csrgemm is an accessible tool for prototyping, we believe a more efficient implementation can be achieved since the sparsity pattern of the Jacobian is static and known ahead of time. We demonstrate that the retraining of pruned networks can potentially benefit from our algorithm. Future work includes investigating custom sparse matrix formats so that the per-step complexity can be reduced even further. We hope that our work will inspire radically new ideas and designs to improve distributed DNN training.

References