I Introduction
Deep Learning [1] has already proven its usability in a variety of applications [2]. In order to achieve better result or to deal with more complex problems, the scale of network gets larger and larger. As large network structures require substantial computational power, memory throughput and storage capacity, training neural networks poses a great challenge to the underlying hardware. Since single processor efficiency has reached the physical limits of the hardware, scaling DNN training over parallel supercomputer becomes a good solution to satisfy the computation and storage requirements.
Sunway TaihuLight [3], a supercomputer that ranks first in the world currently, is powered by the SW26010 manycore processors with a total computing capacity of over 100 PFlops. The SW26010 processor is designed with onchip heterogeneous techniques and provides a peak doubleprecision performance of 3.02 TFlops. Over 40,000 SW26010 processors are connected hierarchically with highbandwidth customized hierarchical network.
Our previous work [4] has already explored the possibility of developing highly efficient convolution subroutines on SW26010. However, there remains great challenges to scale the entire DNN training to larger clusters. First, as mainstream DNN frameworks are usually designed for CPUGPU hybrid system, straightforward migrations or implementations of these frameworks to the brand new architecture can not achieve satisfactory performance. Redesigning a variety of basic DNN layers according to the characteristics of the SW26010 processors is the only way to unleash the potential performance of the supercomputer. Second, parallel training suffers from frequent communications and imbalanced operations among a large number of nodes. A customized communication strategy is necessary to take advantage of the network topology of Sunway TaihuLight. Third, the parallel disk I/O of the input data can also become a bottleneck in largescale DNN training.
To solve the above challenges and facilitate network training tasks on TaihuLight, we redesign the widelyused Caffe framework and customize a set of routines to best fit the unique heterogenous architecture of SW26010, and further scale it to a large number of nodes. Our main contributions are as follows:

We point out a set of general principles for designing parallel algorithm that fit the different aspects of SW26010 hardware characteristics.

A Caffebased framework for SW26010 processor, namely swCaffe, is developed. It incorporates a set of optimization strategies and redesigns a variety of DNN layers to fully squeeze every bit of performance from the SW26010 processors.

We put forward a parallel synchronous SGD method to scale swCaffe on multiple nodes with highlyefficient parameter synchronization and parallel I/O strategy.

The swCaffe is opensourced on [5], which maintaining the same interfaces as Caffe but can be deployed more efficiently on the TaihuLight system.
The rest of the paper is organized as follows. In Section II, we describe Sunway TaihuLight architecture and DNN training methods as backgrounds. In Section IV, we describe the principles for parallel algorithm design on SW26010 and optimization methods of swCaffe for DNN layers based on these principles. In Section V, we present our methodology to scale swCaffe on multiple nodes. In Section VIII, we conclude with a brief discussion of future work.
Ii Background
The Sunway TaihuLight supercomputer is composed of 40,960 nodes with a total of 10,649,600 cores. The nodes are connected through a customized network.
Iia SW26010 Manycore Processor
The general architecture of the SW26010 is shown in Figure 1. The SW26010 processor includes 4 coregroups (CG) connected via the network on chip (NoC). Each CG includes one management processing element (MPE), one computing processing element (CPE) cluster with 8x8 CPEs, and one memory controller (MC). The processor connects to other outside devices through a system interface (SI).
Each group has its own memory space (8GB DDR3 memory for each), which is connected to the MPE and the CPE cluster through the MC. Four core groups connect to four 128bit DDR3 memory controllers with a theoretical memory bandwidth of 136GB/s.
The MPE and CPE are both 64bit RISC cores, which are running at 1.45 GHz with 256bit SIMD instructions supported. Each MPE has a 32 KB L1 data cache, a 32 KB L1 instruction cache, and a 256 KB L2 cache while each CPE has a 16KB instruction cache and a 64 KB local directive memory (LDM), also known as Scratch Pad Memory (SPM), which should be explicitly controlled by user.
The 88 CPEs are able to communicate with each other via register buses. CPEs that fall into the same row or same column can send messages each other through the fast register communication mechanism. In one cycle, the registers support up to 256bit broadcast or P2P communication between two CPEs.
IiB Network Topology of Sunway TaihuLight
The customized network of TaihuLight is divided into 2 levels, namely a fat tree at the top and a supernode network at the bottom. The central switching network is responsible for communicating different supernodes, which is designed to use only a quarter of the potential bandwidth instead of a fully connected network. Each supernode has 256 nodes connected by high bandwidth network using the static destinationbased strategy as its route policy. TaihuLight uses FDR 56Gbps network interface cards (NICs) and provides a 70TB/s bisection network bandwidth in total. The theoretical bandwidth between any two nodes is 16GB/s. However, it only achieves 12GB/s with a latency at the level of microsecond when nodes are communicated with the Message Passing Interface (MPI).
IiC DNN Training Process and Frameworks
Deep learning is used to solve the following optimization problem.
(1) 
where is the model parameters (or weights) we are looking for; is the number of samples; is typically in a form of a DNN;
is the loss function of the
sample. The stochastic gradient descent (SGD) method is the de facto method for DNN training.A typical implementation of SGD is iterating the forwardbackward propagations. The forward propagation step uses a minibatch of training data as input to calculate the intermediate activations after each layers, while the backward propagation step uses the intermediate activations to perform gradient computation. The gradient to model parameters are then applied to the model after each backward propagation step.
Caffe [6] is an opensourced software framework used for DNN training. It is written in C++ and widely adopted in research experiments and industry deployments. Caffe implements DNN training with three major components, namely layers, net and solvers, corresponding to three optimization levels. Layers implement the algorithm of different neural network layers, related with the algorithm level optimization targeting different underlying hardware and platforms. The net defines the network structure of a DNN model and implements the forward and backward propagations, so it allows optimizations for the process of one training iteration, such as process parallelization and memory optimizations. Solvers control the network training process and implement the parameter tuning algorithms, such as Stochastic Gradient Descent (SGD). Therefore, optimizations for network training algorithms and distributed training process should be involved in the solvers. The original Caffe framework is designed for standalone training with one HPC server, and only supports GPU accelerators. In order to efficiently map the framework onto Sunway TaihuLight supercomputer, we need to refactor or redesign the implementation of the above three components, so as to fit the unique architecture of the processors and to support distributed training over multiple nodes.
Iii Design and Implementation of DNN Framework on SW26010
We first present principles of parallel algorithm design on SW26010 and then introduce our strategies to redesign the computing kernels of different DNN layers on SW26010 under the guidelines of these principles.
Iiia Principles of Parallel Algorithm Design on SW26010
The SW26010 is a brand new processor, which is totally different from other manycore processors used for DNN training, such as GPU and Intel Xeon Phi coprocessors. Table I shows the comparison of different aspects among SW26010, GPU and KNL. The methodologies for accelerating neural layers in mainstreaming architectures (GPU, KNL) are not suitable for the SW26010 architecture. It often results in extremely poor performance if we migrate the framework that runs on GPU or KNL to SW26010 in a straight forward way.
A clear understanding of the advantages and disadvantages of the hardware architecture is of great importance to fully squeeze every bit of potential performance from Sunway TaihuLight. As a result, we propose the a set of principles as the guidelines when desiging the high performance applications.
Specifications  SW26010  Nvidia K40m  Intel KNL 

Release Year  2014  2013  2016 
Bandwidth(GB/s)  128  288  475 
float perf. (TFlops )  3.02  4.29  6.92 
double perf. (TFlops )  3.02  1.43  3.46 
Principle 1: Fully utilize CPE mesh for computationintensive tasks. The CPE cluster provides the computing capacity of 742.4 GFlops while the MPE only 11.6 GFlops in each CG theoretically. So the most important step to improve the performance is to offload the computationally intensive kernels to the CPE mesh. Different levels of parallelism can also be carefully exploited within CPE clusters:

The parallelism between 64 CPEs is exploited by orchestrating dataindependent tasks on each CPE simultaneously.

For each CPE, datalevel parallelism can be exploited by using 256bit vector registers for SIMD operations.

In addition, we can exploit instructionlevel parallelism from two instruction pipelines, the floatingpoint pipeline and the memory access pipeline. Both pipeline issues instructions in order on their own pipeline, while independent instructions on different pipelines could be issued out of order.
Principle 2: Always use LDM as intermediary cache for data movements between DDR3 memory. In each CG, the memory controller is responsible for connecting both the MPE and the CPE cluster to the DDR3 memory, which means the MPE and the CPEs share the theoretical memory bandwidth of 32 GB/s. According to the benchmark in Figure 2, the DMA bandwidth saturates around 28 GB/s for both read and write. However, the memory bandwidth between MemorytoMPE and MemorytoLDM is extremely different. The bandwidth of copying data from one DDR3 memory space to another through MemorytoMPE is only 9.9 GB/s. As a result, it is always preferred to use LDM as the intermediary cache, other than accessing main memory from CPEs directly.
Principle 3: Increase available memory bandwidth by transferring large data blocks. The limited aggregated memory bandwidth and the highcomputing power lead to an extremely high flopperbyte ratio, which is
, compared with ratios of 14.90 and 14.56 for K40m and KNL, respectively. To achieve satisfactory DMA bandwidth, we should keep in following points in mind during algorithm design. First, data transfer should be conducted with 64 CPEs together. Second, memory access from the CPE cluster in small granularity should be avoided as much as possible. Size of data to be transferred for each CPE should be larger than 2 KB so that data transfer time can hide the hundreds of cycles LDM transfer latency. Data block size for strided access should be at least 256 bytes so as to achieve satisfactory bandwidth performance.
Principle 4: Reduce memory access by registerlevel communication among CPEs. Besides increasing available bandwidth, we can also reduce the amount of data transfer between LDM and memory to improve performance. The registerlevel communication (RLC), which enables P2P/broadcast 256bit data communications at the register level among CPEs, is a unique hardware characteristic of SW26010. Direct RLCs are allowed only between CPEs within the same row or the same column, following an anonymous producerconsumer pattern with FIFO sending/receiving buffers (i.e., the send instruction is asynchronous, and the sender/receiver gets stalled if the sending/receiving buffer is full/empty). If RLC transfers are fully pipelined, the overall P2P and broadcast bandwidth can reach 2549 GB/s and 4461 GB/s respectively [7]. In this way, we can reuse the data in other LDMs on the same row/column in the CPE cluster to reduce bandwidth requirements between the main memory and LDMs.
Iv Parallel design of DNN layers
A Deep Neural Network consists of different layers. We present our optimization methods for the most frequently used layers in DNN applications, according to the principles pointed out in the previous section.
Iva MatrixMultiplication Layer
The innerproduct layers and other more complicated layers, such as Long Short Time Memory (LSTM) layers, are mainly involving General Matrix to Matrix Multiplication (GEMM) operations. If data locality is fully exploited and near optimal memory bandwidth is achieved, GEMM operations can be implemented with a high floptobyte ratio. To implement it on CPE cluster, we use the register communication proposed in [4][8] to increase data locality in LDM. Assume we intend to perform GEMM operation , where matrix , and are of sizes , , , respectively and can all fit into the 64 KB LDM. Matrices are evenly divided to dimension of size and . A CPE is responsible for computing block of requiring an tile of and a tile of . Note that, in this case, of both tiles of and required by this CPE are resident on remote LDM of other CPEs. According to Principle 4, we can take advantage of the row and column register communication scheme to fetch remote data, as CPEs in the same row of the cluster share the tile of , and CPEs in the same row of the cluster share the tile of .
The GEMM operation can be finished in 8 steps as . (i, j) indicates the coordinate of the CPE, where data is resident, in the cluster. For each time step , CPE loads data of from LDM and broadcasts the data to other CPEs in the same column by column register communication. Similarly, CPE loads data of from LDM and broadcasts the data to CPEs in the same row. Thus, CPE can receive both data of CPE and CPE and the computation of can be done in each time step. Figure 3 illustrates the register communication operations when is 2. This is optimal design with highest floptobyte ratio, as we only require to fetch matrices from memory to LDM once.
Blocking techniques are applied to matrices which are too large to fit into the LDM. As the memoryLDM bandwidth is critical for the GEMM performance, the continuous data sizes of matrix blocks each CPE accesses should be large enough according to Principle 3. As a result the dimension size of matrices should be large enough for good memory bandwidth.
SW26010 provides no inherent support for singleprecision floating point operations, which is default precision option used in DNN. As there is no instruction to support RLC for single precision data in the instruction set of SW26010, we always perform RLC operations with doubleprecision data and we conduct inline transformation for elements between doubleprecision to singleprecision with SIMD instructions.
IvB Convolutional Layer
The convolutional layers are the most computeintensive parts when training Convolutional Neural Networks (CNNs). Both timedomain methods with GEMM operations
[9] and frequencydomain methods with FTT operations [10] are proposed to optimize convolutional layers on GPU. Because GEMM operations can be perfectly optimized on CPE cluster with the registerlevel communication as mentioned previously, we adopt timedomain transformation methods. To support different convolutional layer parameter configurations in real CNN applications, we propose a mixed strategy combining the explicit GEMM plan used in original Caffe and the implicit GEMM plan proposed in [4].IvB1 Explicit GEMM transformation
To map convolution operations to GEMM and reuse the GEMM routine mentioned in Sec.IVA
, we adopt an explicit GEMM transformation proposed for original Caffe. In this case, input tensors are first transferred into matrices by im2col (imagetocolumn) operations before leveraging GEMM operations during forward propagation, while col2im (columntoimage) operations are performed after GEMM operations during backward propagation. Assuming a convolutional layer has filter of size
, im2col operation transfers a 3D multichannel image tensor of size to a 2D matrix of size . and are column and row of output image, where , , where is the convolution stride. is input channel number. is filter channel number. is filter size. The dimension of batchsize is also introduced for blocking, which brings more optimization space for GEMM blocking.As the filter tensor can be viewed as a matrix of size , GEMM operation is performed on two matrices with common dimension of size . Im2col and col2im consist of irregular memory access pattern. The convolutional layers in backward propagation can transfer matrix back to tensor with col2im, which has a reverse memory movement. As indicated by Principle 4, irregular memory access of im2col and col2im should be implemented with DMA on CPE cluster. Figure 4 shows our im2col and col2im plan on one CPE. During im2col process, each CPE reads one row of a input image into LDM buffer with DMA get operation. After adding with pad, each CPE writes line of data into memory. Block sizes are critical for memory bandwidth in GEMM operation.
IvB2 implicit GEMM transformation
The time overheads of im2col and col2im are not negligible for some layers. An implicit GEMM transformation proposed in our previous work [4] is integrated to implement convolutional layers for swCaffe by blocking on dimensions of image width and input and output channels to increase data reuse in LDM. However, when the input and output filter channel numbers are smaller than 64, performance of implicit method would largely degrade, because the amount of data in LDM with small channels is not large enough to support 256bit SIMD and register communication operations.
Real applications apply convolutional layers with input images after zero padding. Considering padding operation has already been implemented combining with im2col/col2im operations in explicit scheme, we also propose a padding optimization in implicit GEMM transformation convolution layers by use a coordinate mapping techniques to avoid explicitly padding operations. Details of padding and more other optimization techniques for convolutional layers can be found in our technique report released with source code [5].
IvC Tensor Transformation Layer
The data of explicit GEMM transformation and implicit GEMM transformation are arranged differently. In the explicit GEMM transformation plan, input and output tensors are of shape and filters are of shape , which is also the default data layout for other layers. In the implicit GEMM transformation plan, input and output tensors are of shape and filters are of shape . Note that the convolutional layers that can be accelerated with implicit transformation are gathered together. The filters are local variables of this layers and its layout do not effect other layers. In swCaffe, we add a tensor transformation layer, which has an 4D tensor input and an 4D tensor output with dimensions transposition between two different data layouts.
The tensor transformation in trans_layer is mainly irregular memory movement and should also be accelerated on CPE cluster. Stride DMA access is adopted to access a block of tensor into LDM. SIMD shuffle instructions are used to transform data after load data from LDM to registers.
IvD Pooling Layer
The pooling layer partitions the input image into a set of nonoverlapping tiles and, for each such subregion, outputs the maximum or average values of elements inside. Since pooling layers are featured with massive memory copy operations, they should be implemented with DMA operations on CPE cluster. We design different movement strategies according to the sizes of input images. Assuming the tile size is . According to Principle 3, we should increase the continuous data size as much as possible for data blocks. Most of times, each CPE is in charge of pooling operation for multiple rows of input image. When rows of image can not be fitted in LDM, we load number of columns into LDM as large as possible. In this case, the data needed by LDM is not continuously stored in memory and strided DMA is used to access it.
V Scaling DNN framework on the TaihuLight
In this section, we describe our design to scale swCaffe on multiple processors.
Va Optimization for Communication of Model Parameters
In our work, we adopt a data parallel scheme with synchronous Stochastic Gradient Descend (SSGD) algorithm to scale swCaffe, which is widely adopted in HPC clusters and supercomputer systems [11][12] considering the high quality of network and balanced performance per node. There are mainly two methods to implement model parameter synchronization in SSGD. One method is using the parameter servers [13] as the intermediary which stores the parameters among several servers. The parameter server scheme is unable to sufficiently exploit the bandwidth potential on fullyconnected network infrastructure of Sunway Taihlight, since the processor has only one network port, thus, receiving gradients simultaneously from a large number of workers could potentially become a bottleneck in the parameter server design and bandwidth between workers are not fully used. The other method is to perform allreduce operations on the gradients among all nodes and to update the parameters on each node independently [12]. We adopt the latter approach to take advantage of the MPI routines optimizing for the supercomputer system, as the former approach is designed for synchronization based on lowbandwidth network infrastructures, like Ethernet. Our parallel synchronous SGD algorithm is described in Algorithm 1.
As shown in Fig.5, we use multiplethreading technique among 4 CGs inside one processor to calculate the averages of gradients. At the beginning of each iteration, we call pthread_create() to start 4 threads on 4 CGs. Each process is able to launch lightweight CPE threads to load work tasks onto CPE cluster, in order to perform forwardbackward propagations of 1/4 of data in that minibatch. Afterwards, each CG achieves its local parameter gradients and CG 0 sums them together to achieve the average gradients of this minibatch. To synchronize the subthreads, we implement a synchronization function by ourself, which is based on a handshake (initiationconfirmation) strategy through the semaphore stored in the shared memory.
To synchronize the gradients across nodes, we implement a customized allreduce communication. The default MPI_Allreduce routine provided by compiler, which is modified from Open MPI^{1}^{1}1https://www.openmpi.org/, can not be directly applied for swCaffe for mainly three reasons. First, the Sunway network is characterized by high latency, thus MPI_Allreduce routines designed for low latency network hardware are no longer suitable in this situation. As shown in Fig. 6, we compare the Sunway network with an Infiniband FDR network. While achieving similar highbandwidth as Infiniband, the Sunway network has higher latency when message size is larger than 2KB. Second, the communication pattern in MPI_Allreduce is not aware of the topology of hierarchical network as mentioned in Sec. IIB. If every node in one supernode performs pointtopoint communication with a different node in another supernode, it will result in oversubscribed interconnect across supernodes. As shown in Fig. 6, the oversubscribed bandwidth between two supernodes is around of full bandwidth. Third, the sum operation after data gathering in MPI_Allreduce is performed on MPEs, thus it is not efficient in the case of large parameter amount.
We improve the allreduce operation considering its high latency and topological property. Before the introduction of our customized algorithms, we use the cost model proposed in [14] to evaluate our allreduce in terms of latency and bandwidth use. We assume that the time taken to send a message between any two nodes can be modeled as , where is the latency (or startup time) per message, independent of message size. is the transfer time per byte, and is the number of bytes transferred. More specifically, is the transfer time inside one supernode and is time across supernodes when bandwidth is oversubscribed. In the case of reduction operations, we define to be the computation cost per byte for performing the reduction operation locally on any node. We also define to be the total number nodes in allreduce operation and to be the number of nodes in one supernode.
Considering the high latency characteristics of the Sunway network, the popular ringbased algorithms [15], having a latency term, are not our best candidates. We choose a binomialtreebased algorithm used in MPICH [14], which has a latency term, as our baseline to improve. An allreduce operation is implemented with an allgather phase after a reducescatter phase. Instead of storing all results at the root node, reducescatter phase adopts the Recursive Halving algorithm to scatter reduction results among all nodes. In the first step, each node exchanges data with a node that is a distance away. Each node sends the data needed by all nodes in the other half, receives the data needed by all nodes in its own half, and performs the reduction operation on the received data. In the second step, each node exchanges data with a node that is a distance away. This procedure continues recursively, halving the data communicated at each step, for a total of steps. Recursive Doubling algorithm, analogous to the Recursive Halving algorithm, is adopted to collect partial results from other nodes for each node in the allgather phase. In the first step, nodes that are a distance 1 apart exchange their data. In the second step, nodes that are a distance 2 apart exchange their own data as well as the data they received in the previous step, which has a size of in total. In the third step, nodes that are a distance 4 apart exchange their own data as well the data they received in the previous two steps, which has a size of in total. Nodes exchange message size up to with the nodes that are a distance apart in the last step. A simple example of such allreduce implementation is illustrated on the left side of Fig. 7.
In the original implementation, nodes within the same supernode are assigned adjacent logical node numbers. In the first several steps of Recursive Halving and last several steps of Recursive Doubling, each node has to communicate with a node far away in another supernode, resulting in oversubscription between supernodes and achieves merely 1/4 of full bidirection bandwidth. The costs of original allreduce are illustrated in Equ. 2, Equ. 3, and Equ. 4. The last two equations are obtained by summing the costs for each time step, which can be viewed as a geometric progression. If is much larger than , term will account for most of the communication time.
(2) 
(3) 
(4) 
We notice that the communication traffic in different steps is not balanced. Recursive Halving gradually reduces traffic, while Recursive Double gradually increases traffic. Considering the topology of the Sunway network, a better allreduce implementation should place heavy communication traffic inside one supernode and light one across supernodes. We redesign the relationship between physical distance and logical distance used in allreduce algorithm, by incrementally assigning logical numbers to nodes of different supernodes in a round robin way. For example, assuming we have 4 supernodes, Nodes numbered 0,4,8,… belong to supernode 0, nodes numbered 1,5,9,… belong to supernode 1, and so on. As shown in Fig. 7, the new allreduce conducts crosssupernode communication in the last steps of reducescatter phase and the first steps of allgather phase. For these steps, we only need to exchange relative small amount of message. The new costs are shown in Equ. 5 and Equ. 6. As we can see, new implemenation largely reduces the coefficient of from to , thus reducing the overhead caused by oversubscribed communication.
(5) 
(6) 
In addition, sum operations after data gathering are implemented on four CPE clusters of the processor. The parameters of different layers can vary greatly in size. In VGG16, the first fullyconnected layer is 102 MB, while the first convolutional layer is only 1.7 KB. Sum operation for layer gradients of small parameter size can be inefficient, because we can not fully utilize the memory bandwidth to access data in small granularity. We pack the gradients of all layers together to performance allreduce after backward propagation. Such scheme can fully utilize both network bandwidth for communication and memory bandwidth for sum operation.
VB Parallel I/O optimization
Computing nodes in Sunway TaihuLight adopt a shared file system. Each worker of the parallel DNN training task uses an I/O thread to prefetch one minibatch data via random sampling prior to each iteration. The file system on Sunway TaihuLight adopts a singlesplit mode for data distribution by default, which indicates that one file will only be distributed on one disk array. In this case, if we read the file concurrently, as the number of processes increases, the aggregate read bandwidth of multiple concurrent processes can quickly reach the upper limit of a single disk array. As a result, each process will get a bandwidth drop and the entire reading time becomes longer.
We improve the aggregated bandwidth of disk arrays by increasing the number of stripe to 32 and modifying the splitting size to 256 MB. Data is distributed on 32 disk array under the round robin strategy with block size as 256 MB. Assume that one process is required to read a minibatch data size of 256 for ImageNet images. The data size for this minibatch is around 192 MB. Since each process always accesses consecutive 192 MB of data, a single process can access at most two disk arrays. Accordingly, the number of processes required per disk array is also reduced to at most , where is the number of processes.
Vi Results
We implement swCaffe with customized Sunway REACH (Open64 based) C compiler and SWMPI 2.2 (Mvapich 2.2 based) C++/MPI compileron TaihuLight. We compare its performance with the original Caffe built with g++4.8.0, CUDA8.0 Tooltik and cuDNNv5.1, and deployed on a hybrid system with an intel 12core E52680 V3 CPU ^{2}^{2}2E52680 V3’s memory bandwidth is 68 GB/s and peak performance is 1.28 TFlops equipped with a NVIDIA K40m GPU card. We conduct our experiments based on the public 1000way ImageNet dataset ^{3}^{3}3http://www.imagenet.org/.
conv  Ni  No  Ci/Ri  forward time(s)  weight_diff backward(s)  in_diff backward(s)  
implicit  explicit  Gflops  implicit  explicit  Gflops  implicit  explicit  Gflops  
1_1  3  64  224  4.19  5.29  1.10  20.18  NA  NA  NA  
1_2  64  64  224  4.30  7.79  110.83  5.22  90.49  14.97  31.63  
2_1  64  128  112  1.63  2.45  146.68  1.33  176.70  3.61  65.65  
2_2  128  128  112  2.34  3.14  202.52  2.26  2.25  209.26  2.39  6.11  198.41 
3_1  128  256  56  1.06  0.73  323.10  0.92  0.68  351.07  0.95  1.69  248.92 
3_2  256  256  56  1.79  1.14  414.62  1.56  1.29  369.23  1.82  3.05  260.47 
3_3  256  256  56  1.79  1.14  415.97  1.56  1.27  376.02  1.82  3.03  260.46 
4_1  256  512  28  0.84  0.69  344.42  0.70  0.71  336.32  0.85  0.95  277.64 
4_2  512  512  28  1.68  1.33  347.36  1.27  1.33  372.75  1.75  1.89  270.54 
4_3  512  512  28  1.68  1.33  348.50  1.27  1.67  372.75  1.75  1.87  270.52 
5_1  512  512  14  0.40  0.62  293.58  0.31  0.65  376.94  0.43  0.80  274.26 
5_2  512  512  14  0.40  0.63  293.58  0.31  0.78  376.94  0.43  0.84  274.26 
5_3  512  512  14  0.40  0.63  293.59  0.31  0.65  377.03  0.43  0.84  274.27 
Via Results for optimizations on different layers
We analyze the performance of convolutional layers with both explicit and implicit GEMM transformation strategies proposed in Sec. IVB. Table II presents the measured time and throughput for each convolutional layer of the VGG16 [16] network with batchsize 128. VGG16 has 12 convolutional layers and covers most commonly used parameter configurations. In terms of the forwardprop in conv1_1 and backwardprop in conv1_1,conv1_2 and conv2_1, implicit strategy is unable to handle small channel sizes and explicit strategy is the only solution. For most parameter configurations, implicit strategy outperforms explicit strategy. However, explicit strategy is slightly better for layers of large image sizes and large channel numbers, where GEMM operations can be performed on large block sizes on matrices generated by im2col. During iterative DNN training process, for layers can be implemented with two methods, swCaffe can run first two iterations to determine the best strategy used for remaining iterations.
Figure 8 and Figure 9 present the processing time for each DNN layer on SW26010 and GPU K40m for forward propagation and backward propagation on AlexNet [17]
and VGG16, respectively. We adopt some refinements to AlexNet without affecting the accuracy by changing the local response normalization (LRN) to batch normalization (BN) in AlexNet. The performance differences between the two architectures mainly come from the following aspects i) Although DNN training has long been considered as a computeintensive task on GPU, we notice that most of DNN training time is spent under bandwidthbounded situation on SW26010. As memory bandwidth of GPU device memory can reach 288 GB/s, bandwidthbounded layers, such as pooling layers, can be processed in device memory very fast. However, these layers still have a significant amount of time on SW26010. ii) Although we achieve comparative performance for most of computeinsensitive layers, for the first two convolutional layers in both networks, SW26010 has low efficiency compared with GPU. Given that these layers have large image sizes, im2col and col2im operations account for most of time in first two layers. In addition, the input/output channel sizes are 3/64 and 64/64 for first two convolutional layers, which is not enough for computebounded blocked GEMM operations. The floptobyte ratio of GEMM operation with
(size of )+=(size of ) (size of ) is . The best ratio is , if . The architectural floptobyte ratio calculated with the best measured bandwidth is . As a result, to make GEMM be computebounded, we have to make . However, small channel size limits the dimension sizes in transformed matrices.ViB Results for different network structures
In Table III, we evaluate the performance of our framework on complete DNN training tasks with different network structures. We use img/sec as an indicator, which indicates the number of images processed in one seconds. AlexNet, VGG16, VGG19 [16], ResNet50 [18] and GoogleNet [19] are tested with batch size as 256, 64, 64, 32, 128, respectively. Compared with 12core CPU, SW26010 with our framework is 3.04x~7.84x faster on five DNNs. Our framework on SW26010 outperforms K40 GPU on AlexNet with a speedup of 1.19x. Data reading from CPU host memory to GPU device memory through PCIE bus accounts for over 40% time during training of AlexNet, as calculation time is too short to hide memory I/O overhead. In contrast, CPEs in SW26010 can directly access memory with DMA so as to eliminate data reading overhead. Our framework on SW26010 achieves 45% and 49% overall performance compared with NVIDIA K40m GPU on AlexNet, VGG16, but with a theoretical memory bandwidth only 44% of that of GPU. Implementations of ResNet50 and GoogleNet with swCaffe achieve 21% and 23% overall performance of GPU Caffe, because their convolutional layers adopt smaller channel settings than VGG16 and VGG19. Since limited memory bandwidth achieved on convolutional layers with small channel numbers, the two networks exhibit stronger memorybounded properties on SW26010.
CPU  NV K40m  SW  SW/NV  SW/CPU  

AlexNet  12.01  79.25  94.17  1.19  7.84 
VGG16  1.06  13.79  6.21  0.45  5.13 
VGG19  1.07  11.2  5.52  0.49  5.15 
ResNet50  1.99  25.45  5.56  0.21  2.79 
GoogleNet  4.92  66.09  14.97  0.23  3.04 
ViC Results for scalability
have increased the minibatch size in dataparallel SGD without losing accuracy over a fixed number of epochs. Large minibatch size can lead more possible parallelism for DNN scaling on multiple nodes, as computing task of each node can achieve high computetocommunication ratio. As shown Figure
10, we scale the AlexNet and ResNet50 to 1024 CPUs. Compared with training speed on a single node, 715.45x, 561.58x and 409.50x speedups with 1024 nodes are achieved for AlexNet trained with subminibatch size as 256, 128, 64, respectively. 928.15x and 828.32x speedups with 1024 nodes are achieved for ResNet50 trained with subminibatch size as 32 and 64, respectively. Although the limit minibatchsize of the current largebatch method [12] for AlexNet and ResNet is 32K, TaihuLight equipped our framework is able to benefit from new training algorithm with larger batchsize.Fig. 11 shows the proportion of communication time during training on AlexNet and ResNet50. The proportion of communication time is 10.65% and 19.11% for ResNet50 trained with subminibatch as 32 and 64 on the scale of 1024 nodes. The proportion of communication time is 60.01%, 45.15% and 30.13% for AlexNet trained with subminibatch as 64, 128, 256 on the scale of 1024 nodes. Since the model parameter size of ResNet50 is less than AlexNet (97.7 MB vs 232.6 MB) and more computation required for ResNet50, high computationtocommunication ratio accounts for better scalability of ResNet50.
Vii Related Works
Existing methods on accelerating basic DNN layers are mainly focused on manycore architectures of NVIDIA GPU and Intel Xeon Phi. Library cuDNN [9] is a widely used GPUaccelerated library of primitives for deep neural networks. Intel MKLDNN [20] is a library of DNN performance primitives optimized for Intel architectures. They both provide a set of highly optimized building blocks intended to accelerate computeintensive parts of deep learning applications.
The work in [21] was first proposed to train DNN models on a CPUGPU hybrid HPC systems. Since then, a large number of works have already been focused on scaling DNN on GPU supercomputers and HPC clusters. InspurCaffe [22] is an MPIbased Caffe fork that exploits parameterserver approach with stale asynchronous gradient updates. FireCaffe [23] discusses scaling of DNN models on a cluster of 128 GPUs connected with Infiniband interconnects. It also adopts a allreducebased parameter synchronization implemented with reduction trees. SCaffe [24] provides modern multiGPU clusters with a CUDAAware MPI runtime for reducing/broadcasting operations and scales DNN training to 160 GPUs.
There are a variety of general DNN frameworks deployed on HPC systems. Tensorflow
[25] developed by Google is the most famous DNN framework that operates at large scale and in heterogeneous environments. It implements communication using the Google RPC library. Caffe2 [26] is developed by Facebook and built based on Caffe. CNTK [27] developed by Microsoft. Both Caffe2 and CNTK natively support MPI for internode communications. MXNet [28] support multiGPU training with a parameter server called PSlite implemented with ZeroMQ library for communication. IntelCaffe [20]can harness the power of Intel KNL coprocessors and supports multinode training by Intel MLSL (Machine Learning Scaling Library), which is a library built on top of MPI and works across various interconnects, like Intel OmniPath, InfiniBand, and Ethernet.
Viii Conclusion
We share our experience on designing a parallel DNN framework called swCaffe on Sunway TaihuLight from processor architecture and networking perspective. Highly optimized routines for DNN layers are derived, fully taking into consideration different aspects of hardware characteristics. We optimize the allreduce operation for parameter synchronization in parallel training process in terms of both the communication topology and the computational approach. Compared to Caffe on NVIDIA K40m GPU, our framework on SW26010 has competitive performance for DNNs with computeintensive convolution operations, such as AlexNet and VGG. Experimences prove our allreduce routine is sufficient for parallel synchronous SGD training.
References
 [1] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, 2015.
 [2] Maryam M Najafabadi, Flavio Villanustre, Taghi M Khoshgoftaar, Naeem Seliya, Randall Wald, and Edin Muharemagic. Deep learning applications and challenges in big data analytics. Journal of Big Data, 2(1):1, 2015.
 [3] Haohuan Fu, Junfeng Liao, Jinzhe Yang, Lanning Wang, Zhenya Song, Xiaomeng Huang, Chao Yang, Wei Xue, Fangfang Liu, Fangli Qiao, et al. The sunway taihulight supercomputer: system and applications. Science China Information Sciences, pages 1–16, 2016.
 [4] Jiarui Fang, Haohuan Fu, Wenlai Zhao, Bingwei Chen, Weijie Zheng, and Guangwen Yang. swdnn: A library for accelerating deep learning applications on sunway taihulight. In Parallel and Distributed Processing Symposium (IPDPS), 2017 IEEE International, pages 615–624. IEEE, 2017.
 [5] https://github.com/feifeibear/SWCaffe.
 [6] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia, pages 675–678. ACM, 2014.
 [7] Zhigeng Xu, James Lin, and Satoshi Matsuoka. Benchmarking sw26010 manycore processor. In Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2017 IEEE International, pages 743–752. IEEE, 2017.
 [8] Lijuan Jiang, Chao Yang, Yulong Ao, Wanwang Yin, Wenjing Ma, Qiao Sun, Fangfang Liu, Rongfen Lin, and Peng Zhang. Towards highly efficient dgemm on the emerging sw26010 manycore processor. In Parallel Processing (ICPP), 2017 46th International Conference on, pages 422–431. IEEE, 2017.
 [9] Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759, 2014.
 [10] Ronan Collobert, Koray Kavukcuoglu, and Clément Farabet. Torch7: A matlablike environment for machine learning. In BigLearn, NIPS Workshop, number EPFLCONF192376, 2011.
 [11] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
 [12] Yang You, Igor Gitman, and Boris Ginsburg. Scaling sgd batch size to 32k for imagenet training. arXiv preprint arXiv:1708.03888, 2017.
 [13] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. Large scale distributed deep networks. In Advances in neural information processing systems, pages 1223–1231, 2012.
 [14] Rajeev Thakur, Rolf Rabenseifner, and William Gropp. Optimization of collective communication operations in mpich. The International Journal of High Performance Computing Applications, 19(1):49–66, 2005.
 [15] Pitch Patarasuk and Xin Yuan. Bandwidth optimal allreduce algorithms for clusters of workstations. Journal of Parallel and Distributed Computing, 69(2):117–124, 2009.
 [16] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 [17] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
 [18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015.

[19]
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir
Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich.
Going deeper with convolutions.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 1–9, 2015.  [20] https://github.com/intel/caffe.
 [21] Tao Wang David J. Wu Bryan Andrew Y. Ng Catanzaro Adam Coates, Brody Huval. Deep learning with cots hpc systems. 2013.
 [22] https://github.com/CaffeMPI/CaffeMPI.github.io.
 [23] Forrest N Iandola, Matthew W Moskewicz, Khalid Ashraf, and Kurt Keutzer. Firecaffe: nearlinear acceleration of deep neural network training on compute clusters. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2592–2600, 2016.
 [24] Ammar Ahmad Awan, Khaled Hamidouche, Jahanzeb Maqbool Hashmi, and Dhabaleswar K Panda. Scaffe: Codesigning mpi runtimes and caffe for scalable deep learning on modern gpu clusters. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 193–205. ACM, 2017.
 [25] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Largescale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.
 [26] https://github.com/caffe2/caffe2.
 [27] Frank Seide and Amit Agarwal. Cntk: Microsoft’s opensource deeplearning toolkit. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 2135–2135. ACM, 2016.
 [28] Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274, 2015.