The training of deep learning models demands more and more compute resources as the models become more powerful and complex with an increasing number of layers in recent yearsKrizhevsky et al. (2012); Szegedy et al. (2015); Simonyan and Zisserman (2015); He et al. (2015); Huang et al. (2016). For example, ResNet can have more than a thousand layers He et al. (2016) and take days to train on the state-of-the-art GPUs Coleman et al. (2017). When the performance of a single compute device plateaus Esmaeilzadeh et al. (2011), training has to be designed to scale on a massively parallel system. Model parallelism Krizhevsky (2014); Harlap et al. (2018); Huang et al. (2018) is one approach to scale training by partitioning a model and distributing its parts among multiple devices. This is the common approach when a model cannot fit into one device due to memory constraints (caused by, for example, deep network architecture, large batch size, or high input resolution Rhu et al. (2016); Zhu et al. (2018)). However, training under model parallelism does not scale well with the number of devices. The fundamental reason for such an scalability issue is that the back-propagation algorithm (BP) Rumelhart et al. (1988) imposes a strong sequential dependency between layers during the gradient computation, which effectively allows at most one device to be utilized at any given point in time. This dependency prevents BP from being efficiently deployed into a distributed computing environment. Therefore, we seek to address this fundamental limitation by restructuring the dependency so that BP can be scaled by parallel algorithms. In summary, we make several major contributions:
We reformulate BP as a scan Blelloch (1990) operation, and modify the Blelloch scan algorithm Blelloch (1990) to efficiently scale BP in a distributed computing environment. Our method has a theoretical step complexity of , where represents the number of devices into which a model is partitioned, compared to of the naïve implementation of model parallelism.
We perform an in-depth runtime analysis of our algorithm, identify the challenges in applying it in a practical setting, and propose a new and efficient approach to tackle these challenges using sparse Jacobian matrices. We then develop routines to efficiently generate transposed Jacobian in sparse format for various operators.
2 Background and Prior Work
To increase the utilization of hardware resources with model parallelism, prior works (e.g., PipeDream Harlap et al. (2018) and GPipe Huang et al. (2018)) propose to pipeline the computation in the forward and backward passes across devices. However, these solutions are not “silver bullets” to scalability for the following reasons. First, both PipeDream Harlap et al. (2018) and GPipe Huang et al. (2018) require storing multiple versions of weights and activations for all mini-batches that enter the pipeline. Thus, the memory consumption grows linearly with the length of the pipeline. As a result, the maximum number of devices that can be supported is limited by the memory capacity of a single device (such as the GPU global memory). Second, if the parameter updates are not fully synchronized as proposed in PipeDream Harlap et al. (2018), staleness can be introduced. Although Harlap et al. argue that the staleness produced by their method does not affect the update step for a vanilla SGD optimizer Harlap et al. (2018), such argument would be invalid when combined with other techniques commonly used in first-order optimizers (such as momentum in Adam Kingma and Ba (2015)). Otherwise, if the gradient updates are fully synchronized as proposed in GPipe Huang et al. (2018), the “bubble of idleness” between the forward and backward passes increases linearly with the length of the pipeline, thus linearly reducing the hardware utilization and decimating the original purpose of pipelining.
Our approach fundamentally differs from these key prior works Harlap et al. (2018); Huang et al. (2018) in the following ways. First, instead of following the dependency of BP, we reformulate BP so that scaling is achieved via the Blelloch scan algorithm Blelloch (1990)
which is designed for parallelism. Second, the original BP is reconstructed exactly, so that estimation errors such as staleness do not exist; therefore, our method is agnostic to the exact first-order optimizer being used. Third, our approach becomes more advantageous as the number of devices increases, instead of diminishing returns or hitting scalability limits.
3 Problem Formulation and Proposed Method
We conceptualize a model as a vector functioncomposed of sub-functions :
where are the parameters of the model. The model is evaluated by an objective function , where is the initial input to the model. This formulation on convolutional neural networks can be visualized in Figure 1.
To train the model , a first-order optimizer requires the gradients which are derived from the gradients :
where is the Jacobian matrix of the output of to its parameters . To derive given , BP Rumelhart et al. (1988) solves the following recursive equation, from to :
where is the Jacobian matrix of the output of to its input . Equation 1 itself does not have dependency along ; therefore, the computation of can be parallelized if are available. However, Equation 2 imposes a strong sequential dependency along where the computation of can not begin until the computation of finishes, and therefore, hinders the scalability when multiple workers (as an abstraction of devices) are available.
3.1 Back-propagation as a Scan Operation
We define a binary, associative and non-commutative operator
whose identity value is the identity matrix, where can be either a matrix or a vector and is a matrix. Using operator , we can reformulate Equation 2 as calculating the following array:
3.2 Scaling Back-propagation with the Blelloch Scan Algorithm
We can parallelize the computation of Equation 3 on multiple workers with the Blelloch scan algorithm Blelloch (1990), formally described in Algorithm 1. The algorithm contains two phases: up-sweep and down-sweep. Figure 2 visualizes this algorithm applied on the convolution layers of VGG-11 Simonyan and Zisserman (2015) with levels L0-L4 as the up-sweep and levels L5-L10 as the down-sweep. Only the up-sweep phase contains matrix-matrix multiplications. Due to the non-commutative property of the operator , we have to reverse the operands of during the down-sweep phase. This modification is reflected on line 13 of Algorithm 1 and visualized in Figure 5.
3.3 Jacobian Matrices in Sparse Format
A full Jacobian matrix of can be too expensive to generate, store, and process. In fact, the Jacobian matrix of the first convolution operator in VGG-11 Simonyan and Zisserman (2015) processing a
image can occupy 768 MB of memory if it is stored as a full matrix, which is prohibitively large. Fortunately, Jacobian matrices of certain operators (such as convolution, ReLU, and max-pooling) can be very sparse as shown in Figure5. In our implementation, the transposed Jacobian matrices are represented in the Compressed Sparse Row (CSR) Saad (2003) format due to existing support of CSR matrix multiplication routines in sparse matrix libraries such as cuSPARSE Corporation (2018a). Representing the data contained in the same Jacobian of the aforementioned convolution operator in CSR shrinks the memory consumption down to only 6.5 MB. Two types of zeros appear in an operator’s Jacobian: guaranteed zeros that are input invariant (e.g., zeros outside of the diagonal of the ReLU’s Jacobian); and possible zeros that are input dependent (e.g., zeros on the diagonal of the ReLU’s Jacobian). For any operator, the pattern of the distribution for guaranteed zeros (named as the sparsity pattern for brevity) in the Jacobian is deterministic with the model architecture and known ahead of time. Thus, this information can be used to build more efficient sparse matrix multiplication routines than the generic ones in conventional sparse matrix libraries (e.g., cuSPARSE). The sparsity of guaranteed zeros (defined as the fraction over all elements in a matrix) for various operators is listed in Table 1.
|Operator||Filter/Kernel Size||Input Size||Output Size||Sparsity||Examples †|
The examples of sparsity for the first convolution, ReLU and max-pooling operators of VGG-11 Simonyan and Zisserman (2015) operating on images are shown in the last column of the table.
Approximation when and
are much greater than the padding size.
3.4 Generating Jacobian Matrix in CSR Analytically
To avoid impractically inefficient methods of generating the Jacobian of an operator one column at a time (either numerically or via automatic differentiation Paszke et al. (2017)), we develop analytical routines to generate the transposed Jacobian directly into the CSR format.
For the transposed Jacobian of a convolution operator with a filter and a padding size of 1 on all sides of the height and width, our methods of generating the CSR indptr, indices and data arrays Varoquaux et al. ([n. d.]) are formally described in Algorithm 2, Algorithm 3 and Algorithm 4 respectively.
Assuming the stride size and the window size are the same, and we can access a tensor (named aspool_indices for brevity) which specifies the indices of the elements in the input tensor that are “pooled” for the output tensor (documented in pyt ([n. d.])), our methods of generating the CSR indptr, indices and data arrays Varoquaux et al. ([n. d.]) are formally described in Algorithm 8, Algorithm 9 and Algorithm 10 respectively.
We evaluate our method against two baselines: (1) the stronger baseline of PyTorch Autograd Paszke et al. (2017) which is a widely adopted implementation of BP; and (2) the weaker baseline of linear scan which is emulating BP by using the transposed Jacobian in CSR and multiplying with the gradient (as shown in Equation 2) explicitly via sparse matrix multiplication routines.
Theoretically, our algorithm is a reconstruction of BP instead of an estimation, and hence, expected to reproduce the exact same outputs. However, in practice, numerical differences could be introduced due to the change in the order of matrix multiplications. We apply our algorithm to train LeNet-5 Lecun et al. (1998) on CIFAR-10 Krizhevsky (2009) to demonstrate that such numerical differences would not affect model convergence. We use a mini-batch size of 256 and the SGD Qian (1999) optimizer with a learning rate of 0.001 and a momentum of 0.9. The result plotted in Figure 6 shows that our algorithm has negligible impact on the convergence compared to the original BP.
4.2 Complexity Analysis
We leverage the following definitions to quantify the complexity of a parallel algorithm: (1) step complexity () which evaluates the number of steps on the critical path; (2) per-step complexity () which evaluates the runtime of a single step; and (3) work complexity () which evaluates the number of total steps executed by all workers. Assuming the system can be conceptualized as a parallel random-access machine (PRAM) Kruskal et al. (1990), the number of workers is and the size of the input array in Equation 4 is , the step and work complexity of our algorithm can be derived as:
compared to of the linear scan (which has the same step and work complexity as BP). Therefore, in an ideal scenario where there is an unbounded number of workers with unit per-step complexity, our algorithm reduces the latency from to . If, however, we consider the difference in per-step complexity between our algorithm () and the baseline () due to runtime difference between matrix-matrix and matrix-vector multiplications, our algorithm has a latency of compared to in the baseline. There are two approaches to make our algorithm achieve a lower latency and better scaling than the baseline. First, we can reduce , which is reflected in leveraging the CSR sparse matrix multiplication routines analyzed in Section 4.3. Second, without a lower , our algorithm can still outperform the baseline if . This can occur when grows larger than the dimension of . Possible use cases include computing inverse kinematics Marschner and Shirley (2016) of a long skeleton (such as simulating the movement of a snake) found in 3D animation.
4.3 Microbenchmark Analysis
We prototype our algorithm with cuSPARSE Corporation (2018a) on a RTX 2070 GPU (Turing architecture) to perform sparse matrix-matrix multiplications (csrgemm) and apply on the convolution layers of VGG-11 Simonyan and Zisserman (2015) (visualized in Figure 2) for training on CIFAR-10 Krizhevsky (2009). The latency of every step (named per-step latency for brevity) is derived by averaging the measurements via CUDA Events Corporation ([n. d.]) from 50 trials after 5 warm-up runs. Each trial performs the backward pass for a single iteration of a single image sample. The goal is to reduce the per-step latency of our method to be below the maximum per-step latency of the baselines, which is measured under the same settings. Figure 6(a) shows the corresponding result. We can observe that many steps of sparse matrix-matrix multiplications meet the weaker baseline of the linear scan; however, a few have significantly higher per-step latencies.
Since the sparsity pattern can be determined ahead of time from the model architecture, the values in the CSR indptr and indices arrays Varoquaux et al. ([n. d.]) stay the same across batches and iterations. Therefore, the process of joining indptr and indices arrays in a generic csrgemm routine Bank and Douglas (1993) could be omitted, and functions that transform the input data arrays to the output data array for each step can be generated and compiled a priori before the training begins, potentially reaching better performance than cuSPARSE Corporation (2018a). Considering the possibility of a better implementation, we project the lower bound of the per-step latency by measuring the number of floating-point operations (FLOP) performed on the CSR data arrays for each step and dividing it by the theoretical FP32 throughput of RTX 2070 Corporation (2018b). The result is shown in Figure 6(b), from which we can observe that most steps meet the stronger baseline of PyTorch Autograd Paszke et al. (2017); however, three matrix-matrix multiplications have latencies significantly higher than others. For every matrix-matrix multiplication, the number of non-zero elements (interpreted as normalized by the size of the matrix) of the left and right operand matrices are plotted in Figure 9. We can observe that the sparsity helps reduce as expected, and the three performance anomalies are caused by the loss of sparsity from the operands.
From Algorithm 4, we observe that the values in the Jacobian of a convolution operator only depend on the filter weights. Thus, pruning the weights can lead to a higher sparsity in the Jacobian, which would reduce the per-step latency of sparse matrix-matrix multiplications. Nevertheless, despite recent advances of network pruning algorithms Han et al. (2015); See et al. (2016); He et al. (2017), there is no existing widely adopted software or hardware platform that can capitalize on pruning, as most techniques are evaluated through “masking simulation” which leads to the same (if not worse due to masking) latency and memory usage. Therefore, we identify the retraining of pruned networks as a strong use case that can benefit from our algorithm. We demonstrate such impact through an experiment on VGG-11 Simonyan and Zisserman (2015): training on CIFAR-10 Krizhevsky (2009); pruning away of the weights in all convolution and linear operators using the technique proposed by See et al. See et al. (2016); and retraining the pruned network. We choose this pruning percentage so that a similar validation accuracy can be reached ( v.s.
) after retraining for the same number of epochs () as training. The lower bound projection on the per-step latency is derived via the same aforementioned method. Based on the result shown in Figure 6(c), we can observe that all of the steps reach (with one anomaly that is close to) the stronger baseline of PyTorch Autograd Paszke et al. (2017).
The runtime overhead added in the forward pass from generating the transposed Jacobian in CSR is not a primary concern because: for input-dependent Jacobian (such as ReLU and max-pooling), the number of non-zero elements is at most the input size , which leads to extremely high sparsity and negligible overhead; and for input-invariant Jacobian (such as convolution), the generation can be performed anytime during one iteration and thus removed from the critical path, and the runtime can be further amortized by a larger batch size.
The sparsity loss in the transposed Jacobian matrices comes from two possible sources. First, based on Table 1, when the ratio of increases for convolutions, the sparsity of the Jacobian reduces. This can occur when the activation passes through max-pooling, and the height and width reduce by the stride factor (half in the case of VGG-11 Simonyan and Zisserman (2015)). This phenomenon is demonstrated in Figure 8(a) with the same experiment in Section 4.3. This issue is less limiting for retraining pruned networks (Figure 8(b)). Second, after the multiplication of two sparse matrices, the product matrix can have more non-zero elements than the operand matrices. Therefore, the sparsity can reduce as the up-sweep phase progresses to deeper levels. This can be observed from Figure 8(b) by comparing with previous levels. This challenge can be mitigated by balancing the number of levels that the up-sweep phase progresses with the sparsity of the product matrices at each level.
Another approach to reduce the per-step complexity for sparse matrix-matrix multiplication is to use Monte Carlo Approximate Matrix Multiplication Drineas et al. (2006). However, a trade-off between the convergence rate and the per-step latency has to be made, as a lower per-step latency can be coupled with an inaccurate matrix product which leads to noticeable errors in the resulting gradient vector .
This paper explores a new direction to scale BP. We reformulate BP into a scan operation which is scaled by our modified version of the Blelloch scan algorithm. While achieving superior scalability in terms of step complexity, per-step complexity increases due to multiplications of transposed Jacobian matrices during the up-sweep phase, as opposed to only multiplications of transposed Jacobian and gradient vectors in the baseline. We develop methods to efficiently generate the transposed Jacobian of various operators directly in CSR, and we leverage the CSR matrix multiplication routine (csrgemm) to reduce the per-step complexity. While csrgemm is an accessible tool for prototyping, we believe a more efficient implementation can be achieved since the sparsity pattern of the Jacobian is static and known ahead of time. We demonstrate that the retraining of pruned networks can potentially benefit from our algorithm. Future work includes investigating custom sparse matrix formats so that the per-step complexity can be reduced even further. We hope that our work will inspire radically new ideas and designs to improve distributed DNN training.
- pyt ([n. d.]) [n. d.]. TORCH.NN. https://pytorch.org/docs/stable/nn.html#torch.nn.MaxPool2d. ([n. d.]). Accessed: 2019-05-23.
- Bank and Douglas (1993) Randolph E. Bank and Craig C. Douglas. 1993. Sparse matrix multiplication package (SMMP). Advances in Computational Mathematics 1, 1 (01 Feb 1993), 127–137. https://doi.org/10.1007/BF02070824
- Blelloch (1990) Guy E. Blelloch. 1990. Prefix Sums and Their Applications. Technical Report CMU-CS-90-190. School of Computer Science, Carnegie Mellon University.
- Coleman et al. (2017) Cody A. Coleman, Deepak Narayanan, Daniel Kang, Tian Jiao Zhao, Jian Zhang, Luigi Nardi, Peter Bailis, Kunle Olukotun, Chester Beatty Re, and Matei A. Zaharia. 2017. DAWNBench : An End-to-End Deep Learning Benchmark and Competition.
- Corporation ([n. d.]) NVIDIA Corporation. [n. d.]. 5.5. Event Management. ([n. d.]). https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__EVENT.html Accessed: 2019-05-20.
- Corporation (2018a) NVIDIA Corporation. 2018a. cuSPARSE :: CUDA Toolkit Documentation. (2018). https://docs.nvidia.com/cuda/cusparse/index.html [Accessed: 2018-11-06].
- Corporation (2018b) NVIDIA Corporation. 2018b. NVIDIA Turing Architecture Whitepaper. Technical Report WP-09183-001_v01. https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf
- Drineas et al. (2006) Petros Drineas, Ravi Kannan, and Michael W. Mahoney. 2006. Fast Monte Carlo Algorithms for Matrices I: Approximating Matrix Multiplication. SIAM J. Comput. 36, 1 (July 2006), 132–157. https://doi.org/10.1137/S0097539704442684
- Esmaeilzadeh et al. (2011) Hadi Esmaeilzadeh, Emily Blem, Renee St. Amant, Karthikeyan Sankaralingam, and Doug Burger. 2011. Dark Silicon and the End of Multicore Scaling. In Proceedings of the 38th Annual International Symposium on Computer Architecture (ISCA ’11). ACM, New York, NY, USA, 365–376. https://doi.org/10.1145/2000064.2000108
- Han et al. (2015) Song Han, Jeff Pool, John Tran, and William Dally. 2015. Learning both Weights and Connections for Efficient Neural Network. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.). Curran Associates, Inc., 1135–1143. http://papers.nips.cc/paper/5784-learning-both-weights-and-connections-for-efficient-neural-network.pdf
- Harlap et al. (2018) Aaron Harlap, Deepak Narayanan, Amar Phanishayee, Vivek Seshadri, Nikhil Devanur, Greg Ganger, and Phil Gibbons. 2018. PipeDream: Fast and Efficient Pipeline Parallel DNN Training. (2018). https://arxiv.org/abs/1806.03377
- He et al. (2015) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. CoRR abs/1512.03385 (2015). arXiv:1512.03385 http://arxiv.org/abs/1512.03385
- He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Identity Mappings in Deep Residual Networks. CoRR abs/1603.05027 (2016). arXiv:1603.05027 http://arxiv.org/abs/1603.05027
- He et al. (2017) Yihui He, Xiangyu Zhang, and Jian Sun. 2017. Channel Pruning for Accelerating Very Deep Neural Networks. CoRR abs/1707.06168 (2017). arXiv:1707.06168 http://arxiv.org/abs/1707.06168
- Huang et al. (2016) Gao Huang, Zhuang Liu, and Kilian Q. Weinberger. 2016. Densely Connected Convolutional Networks. CoRR abs/1608.06993 (2016). arXiv:1608.06993 http://arxiv.org/abs/1608.06993
- Huang et al. (2018) Yanping Huang, Yonglong Cheng, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, and Zhifeng Chen. 2018. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism. CoRR abs/1811.06965 (2018). arXiv:1811.06965 http://arxiv.org/abs/1811.06965
- Kingma and Ba (2015) Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings. http://arxiv.org/abs/1412.6980
- Krizhevsky (2009) Alex Krizhevsky. 2009. Learning multiple layers of features from tiny images. Technical Report.
- Krizhevsky (2014) Alex Krizhevsky. 2014. One weird trick for parallelizing convolutional neural networks. CoRR abs/1404.5997 (2014). arXiv:1404.5997 http://arxiv.org/abs/1404.5997
- Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1 (NIPS’12). Curran Associates Inc., USA, 1097–1105. http://dl.acm.org/citation.cfm?id=2999134.2999257
- Kruskal et al. (1990) Clyde P. Kruskal, Larry Rudolph, and Marc Snir. 1990. A complexity theory of efficient parallel algorithms. Theoretical Computer Science 71, 1 (1990), 95 – 132. https://doi.org/10.1016/0304-3975(90)90192-K
- Lecun et al. (1998) Yann Lecun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. In Proceedings of the IEEE. 2278–2324.
- Marschner and Shirley (2016) Steve Marschner and Peter Shirley. 2016. Fundamentals of Computer Graphics, Fourth Edition (4th ed.). A. K. Peters, Ltd., Natick, MA, USA.
- Paszke et al. (2017) Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic Differentiation in PyTorch. In NIPS Autodiff Workshop.
- Qian (1999) Ning Qian. 1999. On the Momentum Term in Gradient Descent Learning Algorithms. Neural Netw. 12, 1 (Jan. 1999), 145–151. https://doi.org/10.1016/S0893-6080(98)00116-6
- Rhu et al. (2016) Minsoo Rhu, Natalia Gimelshein, Jason Clemons, Arslan Zulfiqar, and Stephen W. Keckler. 2016. Virtualizing Deep Neural Networks for Memory-Efficient Neural Network Design. CoRR abs/1602.08124 (2016). arXiv:1602.08124 http://arxiv.org/abs/1602.08124
- Rumelhart et al. (1988) David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. 1988. Neurocomputing: Foundations of Research. MIT Press, Cambridge, MA, USA, Chapter Learning Representations by Back-propagating Errors, 696–699. http://dl.acm.org/citation.cfm?id=65669.104451
- Saad (2003) Y. Saad. 2003. Iterative Methods for Sparse Linear Systems (2nd ed.). Society for Industrial and Applied Mathematics, Philadelphia, PA, USA.
- See et al. (2016) Abigail See, Minh-Thang Luong, and Christopher D. Manning. 2016. Compression of Neural Machine Translation Models via Pruning. CoRR abs/1606.09274 (2016). arXiv:1606.09274 http://arxiv.org/abs/1606.09274
- Simonyan and Zisserman (2015) Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings. http://arxiv.org/abs/1409.1556
- Szegedy et al. (2015) Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going Deeper with Convolutions. In Computer Vision and Pattern Recognition (CVPR). http://arxiv.org/abs/1409.4842
- Varoquaux et al. ([n. d.]) Gaël Varoquaux, Emmanuelle Gouillart, and Olav Vahtras. [n. d.]. Compressed Sparse Row Format (CSR). ([n. d.]). https://scipy-lectures.org/advanced/scipy_sparse/csr_matrix.html Accessed: 2019-05-23.
- Zhu et al. (2018) Hongyu Zhu, Mohamed Akrout, Bojian Zheng, Andrew Pelegris, Anand Jayarajan, Amar Phanishayee, Bianca Schroeder, and Gennady Pekhimenko. 2018. Benchmarking and Analyzing Deep Neural Network Training. In 2018 IEEE International Symposium on Workload Characterization, IISWC 2018, Raleigh, NC, USA, September 30 - October 2, 2018. 88–100. https://doi.org/10.1109/IISWC.2018.8573476