1 Introduction
The training of deep learning models demands more and more compute resources as the models become more powerful and complex with an increasing number of layers in recent years
Krizhevsky et al. (2012); Szegedy et al. (2015); Simonyan and Zisserman (2015); He et al. (2015); Huang et al. (2016). For example, ResNet can have more than a thousand layers He et al. (2016) and take days to train on the stateoftheart GPUs Coleman et al. (2017). When the performance of a single compute device plateaus Esmaeilzadeh et al. (2011), training has to be designed to scale on a massively parallel system. Model parallelism Krizhevsky (2014); Harlap et al. (2018); Huang et al. (2018) is one approach to scale training by partitioning a model and distributing its parts among multiple devices. This is the common approach when a model cannot fit into one device due to memory constraints (caused by, for example, deep network architecture, large batch size, or high input resolution Rhu et al. (2016); Zhu et al. (2018)). However, training under model parallelism does not scale well with the number of devices. The fundamental reason for such an scalability issue is that the backpropagation algorithm (BP) Rumelhart et al. (1988) imposes a strong sequential dependency between layers during the gradient computation, which effectively allows at most one device to be utilized at any given point in time. This dependency prevents BP from being efficiently deployed into a distributed computing environment. Therefore, we seek to address this fundamental limitation by restructuring the dependency so that BP can be scaled by parallel algorithms. In summary, we make several major contributions:
We reformulate BP as a scan Blelloch (1990) operation, and modify the Blelloch scan algorithm Blelloch (1990) to efficiently scale BP in a distributed computing environment. Our method has a theoretical step complexity of , where represents the number of devices into which a model is partitioned, compared to of the naïve implementation of model parallelism.

We perform an indepth runtime analysis of our algorithm, identify the challenges in applying it in a practical setting, and propose a new and efficient approach to tackle these challenges using sparse Jacobian matrices. We then develop routines to efficiently generate transposed Jacobian in sparse format for various operators.
2 Background and Prior Work
To increase the utilization of hardware resources with model parallelism, prior works (e.g., PipeDream Harlap et al. (2018) and GPipe Huang et al. (2018)) propose to pipeline the computation in the forward and backward passes across devices. However, these solutions are not “silver bullets” to scalability for the following reasons. First, both PipeDream Harlap et al. (2018) and GPipe Huang et al. (2018) require storing multiple versions of weights and activations for all minibatches that enter the pipeline. Thus, the memory consumption grows linearly with the length of the pipeline. As a result, the maximum number of devices that can be supported is limited by the memory capacity of a single device (such as the GPU global memory). Second, if the parameter updates are not fully synchronized as proposed in PipeDream Harlap et al. (2018), staleness can be introduced. Although Harlap et al. argue that the staleness produced by their method does not affect the update step for a vanilla SGD optimizer Harlap et al. (2018), such argument would be invalid when combined with other techniques commonly used in firstorder optimizers (such as momentum in Adam Kingma and Ba (2015)). Otherwise, if the gradient updates are fully synchronized as proposed in GPipe Huang et al. (2018), the “bubble of idleness” between the forward and backward passes increases linearly with the length of the pipeline, thus linearly reducing the hardware utilization and decimating the original purpose of pipelining.
Our approach fundamentally differs from these key prior works Harlap et al. (2018); Huang et al. (2018) in the following ways. First, instead of following the dependency of BP, we reformulate BP so that scaling is achieved via the Blelloch scan algorithm Blelloch (1990)
which is designed for parallelism. Second, the original BP is reconstructed exactly, so that estimation errors such as staleness do not exist; therefore, our method is agnostic to the exact firstorder optimizer being used. Third, our approach becomes more advantageous as the number of devices increases, instead of diminishing returns or hitting scalability limits.
3 Problem Formulation and Proposed Method
We conceptualize a model as a vector function
composed of subfunctions :where are the parameters of the model. The model is evaluated by an objective function , where is the initial input to the model. This formulation on convolutional neural networks can be visualized in Figure 1.
To train the model , a firstorder optimizer requires the gradients which are derived from the gradients :
(1) 
where is the Jacobian matrix of the output of to its parameters . To derive given , BP Rumelhart et al. (1988) solves the following recursive equation, from to :
(2) 
where is the Jacobian matrix of the output of to its input . Equation 1 itself does not have dependency along ; therefore, the computation of can be parallelized if are available. However, Equation 2 imposes a strong sequential dependency along where the computation of can not begin until the computation of finishes, and therefore, hinders the scalability when multiple workers (as an abstraction of devices) are available.
3.1 Backpropagation as a Scan Operation
We define a binary, associative and noncommutative operator
whose identity value is the identity matrix
, where can be either a matrix or a vector and is a matrix. Using operator , we can reformulate Equation 2 as calculating the following array:(3) 
Based on the definition proposed by Blelloch Blelloch (1990), Equation 3 can be interpreted as a scan (allprefixsums) operation of on the input array:
(4) 
3.2 Scaling Backpropagation with the Blelloch Scan Algorithm
We can parallelize the computation of Equation 3 on multiple workers with the Blelloch scan algorithm Blelloch (1990), formally described in Algorithm 1. The algorithm contains two phases: upsweep and downsweep. Figure 2 visualizes this algorithm applied on the convolution layers of VGG11 Simonyan and Zisserman (2015) with levels L0L4 as the upsweep and levels L5L10 as the downsweep. Only the upsweep phase contains matrixmatrix multiplications. Due to the noncommutative property of the operator , we have to reverse the operands of during the downsweep phase. This modification is reflected on line 13 of Algorithm 1 and visualized in Figure 5.
3.3 Jacobian Matrices in Sparse Format
A full Jacobian matrix of can be too expensive to generate, store, and process. In fact, the Jacobian matrix of the first convolution operator in VGG11 Simonyan and Zisserman (2015) processing a
image can occupy 768 MB of memory if it is stored as a full matrix, which is prohibitively large. Fortunately, Jacobian matrices of certain operators (such as convolution, ReLU, and maxpooling) can be very sparse as shown in Figure
5. In our implementation, the transposed Jacobian matrices are represented in the Compressed Sparse Row (CSR) Saad (2003) format due to existing support of CSR matrix multiplication routines in sparse matrix libraries such as cuSPARSE Corporation (2018a). Representing the data contained in the same Jacobian of the aforementioned convolution operator in CSR shrinks the memory consumption down to only 6.5 MB. Two types of zeros appear in an operator’s Jacobian: guaranteed zeros that are input invariant (e.g., zeros outside of the diagonal of the ReLU’s Jacobian); and possible zeros that are input dependent (e.g., zeros on the diagonal of the ReLU’s Jacobian). For any operator, the pattern of the distribution for guaranteed zeros (named as the sparsity pattern for brevity) in the Jacobian is deterministic with the model architecture and known ahead of time. Thus, this information can be used to build more efficient sparse matrix multiplication routines than the generic ones in conventional sparse matrix libraries (e.g., cuSPARSE). The sparsity of guaranteed zeros (defined as the fraction over all elements in a matrix) for various operators is listed in Table 1.Operator  Filter/Kernel Size  Input Size  Output Size  Sparsity  Examples ^{†} 
Convolution  ^{‡}  0.99157  
ReLU  N/A  0.99998  
Maxpooling  0.99994 

The examples of sparsity for the first convolution, ReLU and maxpooling operators of VGG11 Simonyan and Zisserman (2015) operating on images are shown in the last column of the table.

Approximation when and
are much greater than the padding size.
3.4 Generating Jacobian Matrix in CSR Analytically
To avoid impractically inefficient methods of generating the Jacobian of an operator one column at a time (either numerically or via automatic differentiation Paszke et al. (2017)), we develop analytical routines to generate the transposed Jacobian directly into the CSR format.
Convolution
For the transposed Jacobian of a convolution operator with a filter and a padding size of 1 on all sides of the height and width, our methods of generating the CSR indptr, indices and data arrays Varoquaux et al. ([n. d.]) are formally described in Algorithm 2, Algorithm 3 and Algorithm 4 respectively.
ReLU
Maxpooling
Assuming the stride size and the window size are the same, and we can access a tensor (named as
pool_indices for brevity) which specifies the indices of the elements in the input tensor that are “pooled” for the output tensor (documented in pyt ([n. d.])), our methods of generating the CSR indptr, indices and data arrays Varoquaux et al. ([n. d.]) are formally described in Algorithm 8, Algorithm 9 and Algorithm 10 respectively.4 Evaluation
We evaluate our method against two baselines: (1) the stronger baseline of PyTorch Autograd Paszke et al. (2017) which is a widely adopted implementation of BP; and (2) the weaker baseline of linear scan which is emulating BP by using the transposed Jacobian in CSR and multiplying with the gradient (as shown in Equation 2) explicitly via sparse matrix multiplication routines.
4.1 Convergence
Theoretically, our algorithm is a reconstruction of BP instead of an estimation, and hence, expected to reproduce the exact same outputs. However, in practice, numerical differences could be introduced due to the change in the order of matrix multiplications. We apply our algorithm to train LeNet5 Lecun et al. (1998) on CIFAR10 Krizhevsky (2009) to demonstrate that such numerical differences would not affect model convergence. We use a minibatch size of 256 and the SGD Qian (1999) optimizer with a learning rate of 0.001 and a momentum of 0.9. The result plotted in Figure 6 shows that our algorithm has negligible impact on the convergence compared to the original BP.
4.2 Complexity Analysis
We leverage the following definitions to quantify the complexity of a parallel algorithm: (1) step complexity () which evaluates the number of steps on the critical path; (2) perstep complexity () which evaluates the runtime of a single step; and (3) work complexity () which evaluates the number of total steps executed by all workers. Assuming the system can be conceptualized as a parallel randomaccess machine (PRAM) Kruskal et al. (1990), the number of workers is and the size of the input array in Equation 4 is , the step and work complexity of our algorithm can be derived as:
(5)  
(6) 
compared to of the linear scan (which has the same step and work complexity as BP). Therefore, in an ideal scenario where there is an unbounded number of workers with unit perstep complexity, our algorithm reduces the latency from to . If, however, we consider the difference in perstep complexity between our algorithm () and the baseline () due to runtime difference between matrixmatrix and matrixvector multiplications, our algorithm has a latency of compared to in the baseline. There are two approaches to make our algorithm achieve a lower latency and better scaling than the baseline. First, we can reduce , which is reflected in leveraging the CSR sparse matrix multiplication routines analyzed in Section 4.3. Second, without a lower , our algorithm can still outperform the baseline if . This can occur when grows larger than the dimension of . Possible use cases include computing inverse kinematics Marschner and Shirley (2016) of a long skeleton (such as simulating the movement of a snake) found in 3D animation.
4.3 Microbenchmark Analysis
We prototype our algorithm with cuSPARSE Corporation (2018a) on a RTX 2070 GPU (Turing architecture) to perform sparse matrixmatrix multiplications (csrgemm) and apply on the convolution layers of VGG11 Simonyan and Zisserman (2015) (visualized in Figure 2) for training on CIFAR10 Krizhevsky (2009). The latency of every step (named perstep latency for brevity) is derived by averaging the measurements via CUDA Events Corporation ([n. d.]) from 50 trials after 5 warmup runs. Each trial performs the backward pass for a single iteration of a single image sample. The goal is to reduce the perstep latency of our method to be below the maximum perstep latency of the baselines, which is measured under the same settings. Figure 6(a) shows the corresponding result. We can observe that many steps of sparse matrixmatrix multiplications meet the weaker baseline of the linear scan; however, a few have significantly higher perstep latencies.
Since the sparsity pattern can be determined ahead of time from the model architecture, the values in the CSR indptr and indices arrays Varoquaux et al. ([n. d.]) stay the same across batches and iterations. Therefore, the process of joining indptr and indices arrays in a generic csrgemm routine Bank and Douglas (1993) could be omitted, and functions that transform the input data arrays to the output data array for each step can be generated and compiled a priori before the training begins, potentially reaching better performance than cuSPARSE Corporation (2018a). Considering the possibility of a better implementation, we project the lower bound of the perstep latency by measuring the number of floatingpoint operations (FLOP) performed on the CSR data arrays for each step and dividing it by the theoretical FP32 throughput of RTX 2070 Corporation (2018b). The result is shown in Figure 6(b), from which we can observe that most steps meet the stronger baseline of PyTorch Autograd Paszke et al. (2017); however, three matrixmatrix multiplications have latencies significantly higher than others. For every matrixmatrix multiplication, the number of nonzero elements (interpreted as normalized by the size of the matrix) of the left and right operand matrices are plotted in Figure 9. We can observe that the sparsity helps reduce as expected, and the three performance anomalies are caused by the loss of sparsity from the operands.
From Algorithm 4, we observe that the values in the Jacobian of a convolution operator only depend on the filter weights. Thus, pruning the weights can lead to a higher sparsity in the Jacobian, which would reduce the perstep latency of sparse matrixmatrix multiplications. Nevertheless, despite recent advances of network pruning algorithms Han et al. (2015); See et al. (2016); He et al. (2017), there is no existing widely adopted software or hardware platform that can capitalize on pruning, as most techniques are evaluated through “masking simulation” which leads to the same (if not worse due to masking) latency and memory usage. Therefore, we identify the retraining of pruned networks as a strong use case that can benefit from our algorithm. We demonstrate such impact through an experiment on VGG11 Simonyan and Zisserman (2015): training on CIFAR10 Krizhevsky (2009); pruning away of the weights in all convolution and linear operators using the technique proposed by See et al. See et al. (2016); and retraining the pruned network. We choose this pruning percentage so that a similar validation accuracy can be reached ( v.s.
) after retraining for the same number of epochs (
) as training. The lower bound projection on the perstep latency is derived via the same aforementioned method. Based on the result shown in Figure 6(c), we can observe that all of the steps reach (with one anomaly that is close to) the stronger baseline of PyTorch Autograd Paszke et al. (2017).5 Discussion
The runtime overhead added in the forward pass from generating the transposed Jacobian in CSR is not a primary concern because: for inputdependent Jacobian (such as ReLU and maxpooling), the number of nonzero elements is at most the input size , which leads to extremely high sparsity and negligible overhead; and for inputinvariant Jacobian (such as convolution), the generation can be performed anytime during one iteration and thus removed from the critical path, and the runtime can be further amortized by a larger batch size.
The sparsity loss in the transposed Jacobian matrices comes from two possible sources. First, based on Table 1, when the ratio of increases for convolutions, the sparsity of the Jacobian reduces. This can occur when the activation passes through maxpooling, and the height and width reduce by the stride factor (half in the case of VGG11 Simonyan and Zisserman (2015)). This phenomenon is demonstrated in Figure 8(a) with the same experiment in Section 4.3. This issue is less limiting for retraining pruned networks (Figure 8(b)). Second, after the multiplication of two sparse matrices, the product matrix can have more nonzero elements than the operand matrices. Therefore, the sparsity can reduce as the upsweep phase progresses to deeper levels. This can be observed from Figure 8(b) by comparing with previous levels. This challenge can be mitigated by balancing the number of levels that the upsweep phase progresses with the sparsity of the product matrices at each level.
Another approach to reduce the perstep complexity for sparse matrixmatrix multiplication is to use Monte Carlo Approximate Matrix Multiplication Drineas et al. (2006). However, a tradeoff between the convergence rate and the perstep latency has to be made, as a lower perstep latency can be coupled with an inaccurate matrix product which leads to noticeable errors in the resulting gradient vector .
6 Conclusion
This paper explores a new direction to scale BP. We reformulate BP into a scan operation which is scaled by our modified version of the Blelloch scan algorithm. While achieving superior scalability in terms of step complexity, perstep complexity increases due to multiplications of transposed Jacobian matrices during the upsweep phase, as opposed to only multiplications of transposed Jacobian and gradient vectors in the baseline. We develop methods to efficiently generate the transposed Jacobian of various operators directly in CSR, and we leverage the CSR matrix multiplication routine (csrgemm) to reduce the perstep complexity. While csrgemm is an accessible tool for prototyping, we believe a more efficient implementation can be achieved since the sparsity pattern of the Jacobian is static and known ahead of time. We demonstrate that the retraining of pruned networks can potentially benefit from our algorithm. Future work includes investigating custom sparse matrix formats so that the perstep complexity can be reduced even further. We hope that our work will inspire radically new ideas and designs to improve distributed DNN training.
References
 (1)
 pyt ([n. d.]) [n. d.]. TORCH.NN. https://pytorch.org/docs/stable/nn.html#torch.nn.MaxPool2d. ([n. d.]). Accessed: 20190523.
 Bank and Douglas (1993) Randolph E. Bank and Craig C. Douglas. 1993. Sparse matrix multiplication package (SMMP). Advances in Computational Mathematics 1, 1 (01 Feb 1993), 127–137. https://doi.org/10.1007/BF02070824
 Blelloch (1990) Guy E. Blelloch. 1990. Prefix Sums and Their Applications. Technical Report CMUCS90190. School of Computer Science, Carnegie Mellon University.
 Coleman et al. (2017) Cody A. Coleman, Deepak Narayanan, Daniel Kang, Tian Jiao Zhao, Jian Zhang, Luigi Nardi, Peter Bailis, Kunle Olukotun, Chester Beatty Re, and Matei A. Zaharia. 2017. DAWNBench : An EndtoEnd Deep Learning Benchmark and Competition.
 Corporation ([n. d.]) NVIDIA Corporation. [n. d.]. 5.5. Event Management. ([n. d.]). https://docs.nvidia.com/cuda/cudaruntimeapi/group__CUDART__EVENT.html Accessed: 20190520.
 Corporation (2018a) NVIDIA Corporation. 2018a. cuSPARSE :: CUDA Toolkit Documentation. (2018). https://docs.nvidia.com/cuda/cusparse/index.html [Accessed: 20181106].
 Corporation (2018b) NVIDIA Corporation. 2018b. NVIDIA Turing Architecture Whitepaper. Technical Report WP09183001_v01. https://www.nvidia.com/content/dam/enzz/Solutions/designvisualization/technologies/turingarchitecture/NVIDIATuringArchitectureWhitepaper.pdf
 Drineas et al. (2006) Petros Drineas, Ravi Kannan, and Michael W. Mahoney. 2006. Fast Monte Carlo Algorithms for Matrices I: Approximating Matrix Multiplication. SIAM J. Comput. 36, 1 (July 2006), 132–157. https://doi.org/10.1137/S0097539704442684
 Esmaeilzadeh et al. (2011) Hadi Esmaeilzadeh, Emily Blem, Renee St. Amant, Karthikeyan Sankaralingam, and Doug Burger. 2011. Dark Silicon and the End of Multicore Scaling. In Proceedings of the 38th Annual International Symposium on Computer Architecture (ISCA ’11). ACM, New York, NY, USA, 365–376. https://doi.org/10.1145/2000064.2000108
 Han et al. (2015) Song Han, Jeff Pool, John Tran, and William Dally. 2015. Learning both Weights and Connections for Efficient Neural Network. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.). Curran Associates, Inc., 1135–1143. http://papers.nips.cc/paper/5784learningbothweightsandconnectionsforefficientneuralnetwork.pdf
 Harlap et al. (2018) Aaron Harlap, Deepak Narayanan, Amar Phanishayee, Vivek Seshadri, Nikhil Devanur, Greg Ganger, and Phil Gibbons. 2018. PipeDream: Fast and Efficient Pipeline Parallel DNN Training. (2018). https://arxiv.org/abs/1806.03377
 He et al. (2015) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. CoRR abs/1512.03385 (2015). arXiv:1512.03385 http://arxiv.org/abs/1512.03385
 He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Identity Mappings in Deep Residual Networks. CoRR abs/1603.05027 (2016). arXiv:1603.05027 http://arxiv.org/abs/1603.05027
 He et al. (2017) Yihui He, Xiangyu Zhang, and Jian Sun. 2017. Channel Pruning for Accelerating Very Deep Neural Networks. CoRR abs/1707.06168 (2017). arXiv:1707.06168 http://arxiv.org/abs/1707.06168
 Huang et al. (2016) Gao Huang, Zhuang Liu, and Kilian Q. Weinberger. 2016. Densely Connected Convolutional Networks. CoRR abs/1608.06993 (2016). arXiv:1608.06993 http://arxiv.org/abs/1608.06993
 Huang et al. (2018) Yanping Huang, Yonglong Cheng, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, and Zhifeng Chen. 2018. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism. CoRR abs/1811.06965 (2018). arXiv:1811.06965 http://arxiv.org/abs/1811.06965
 Kingma and Ba (2015) Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 79, 2015, Conference Track Proceedings. http://arxiv.org/abs/1412.6980
 Krizhevsky (2009) Alex Krizhevsky. 2009. Learning multiple layers of features from tiny images. Technical Report.
 Krizhevsky (2014) Alex Krizhevsky. 2014. One weird trick for parallelizing convolutional neural networks. CoRR abs/1404.5997 (2014). arXiv:1404.5997 http://arxiv.org/abs/1404.5997
 Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems  Volume 1 (NIPS’12). Curran Associates Inc., USA, 1097–1105. http://dl.acm.org/citation.cfm?id=2999134.2999257
 Kruskal et al. (1990) Clyde P. Kruskal, Larry Rudolph, and Marc Snir. 1990. A complexity theory of efficient parallel algorithms. Theoretical Computer Science 71, 1 (1990), 95 – 132. https://doi.org/10.1016/03043975(90)90192K
 Lecun et al. (1998) Yann Lecun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradientbased learning applied to document recognition. In Proceedings of the IEEE. 2278–2324.
 Marschner and Shirley (2016) Steve Marschner and Peter Shirley. 2016. Fundamentals of Computer Graphics, Fourth Edition (4th ed.). A. K. Peters, Ltd., Natick, MA, USA.
 Paszke et al. (2017) Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic Differentiation in PyTorch. In NIPS Autodiff Workshop.
 Qian (1999) Ning Qian. 1999. On the Momentum Term in Gradient Descent Learning Algorithms. Neural Netw. 12, 1 (Jan. 1999), 145–151. https://doi.org/10.1016/S08936080(98)001166
 Rhu et al. (2016) Minsoo Rhu, Natalia Gimelshein, Jason Clemons, Arslan Zulfiqar, and Stephen W. Keckler. 2016. Virtualizing Deep Neural Networks for MemoryEfficient Neural Network Design. CoRR abs/1602.08124 (2016). arXiv:1602.08124 http://arxiv.org/abs/1602.08124
 Rumelhart et al. (1988) David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. 1988. Neurocomputing: Foundations of Research. MIT Press, Cambridge, MA, USA, Chapter Learning Representations by Backpropagating Errors, 696–699. http://dl.acm.org/citation.cfm?id=65669.104451
 Saad (2003) Y. Saad. 2003. Iterative Methods for Sparse Linear Systems (2nd ed.). Society for Industrial and Applied Mathematics, Philadelphia, PA, USA.
 See et al. (2016) Abigail See, MinhThang Luong, and Christopher D. Manning. 2016. Compression of Neural Machine Translation Models via Pruning. CoRR abs/1606.09274 (2016). arXiv:1606.09274 http://arxiv.org/abs/1606.09274
 Simonyan and Zisserman (2015) Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for LargeScale Image Recognition. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 79, 2015, Conference Track Proceedings. http://arxiv.org/abs/1409.1556
 Szegedy et al. (2015) Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going Deeper with Convolutions. In Computer Vision and Pattern Recognition (CVPR). http://arxiv.org/abs/1409.4842
 Varoquaux et al. ([n. d.]) Gaël Varoquaux, Emmanuelle Gouillart, and Olav Vahtras. [n. d.]. Compressed Sparse Row Format (CSR). ([n. d.]). https://scipylectures.org/advanced/scipy_sparse/csr_matrix.html Accessed: 20190523.
 Zhu et al. (2018) Hongyu Zhu, Mohamed Akrout, Bojian Zheng, Andrew Pelegris, Anand Jayarajan, Amar Phanishayee, Bianca Schroeder, and Gennady Pekhimenko. 2018. Benchmarking and Analyzing Deep Neural Network Training. In 2018 IEEE International Symposium on Workload Characterization, IISWC 2018, Raleigh, NC, USA, September 30  October 2, 2018. 88–100. https://doi.org/10.1109/IISWC.2018.8573476
Comments
There are no comments yet.