1 Introduction
In recent years, Deep Learning (DL) techniques have achieved promising results in various domains [1, 2]. Convolutional Neural Network (CNN) algorithm is an important branch of DL. Benefitting from largescale training datasets and the complex training network, CNN achieves high accuracy and is widely applied in various domains, such as image classification [3], speech recognition [4], and text processing [5]. However, the training process of CNN is very timeconsuming, in which large amounts of training samples and iterative operations are required to obtain highquality weight parameters [6, 7]. It is critical to accelerate the training process and improve the performance of CNN. Cloud computing, highperformance computing cluster, and supercomputing provides strong computing power for various applications [8, 9, 10]. Therefore, it is a critical issue that how to design an effective parallel CNN training model based on distributed computing clusters and address the challenges of data communication, synchronization, and workload balancing, while maintaining high performance and high accuracy.
Numerous enhancements were proposed to accelerate the CNN and DL algorithms by improving the execution environments, mainly focusing in two directions [11, 12, 13, 8, 14, 15, 16]: (1) using multicore CPUs or GPUs platform to provide highspeed processing capacities, and (2) using multiple distributed computers to increase the available computing power. The CPU/GPU based methods [12, 13, 8] can perform more arithmetic operations and are suitable for the training of modestly sized DL models. However, a known limitation of the these methods is that the training acceleration is small when the scale of training datasets or DL models exceeds the GPU memory capacity. To process largescale datasets and DL models, some distributed architecture based solutions were proposed, such as DistBelief[14]
, Caffe
[15], and Tensorflow
[16]. Each of these approaches achieves significant progress in different perspectives, with implicit own complications in engineering and implementation. Considerable improvements might be obtained by combining the advantages in both aspects and make use of the computing power of multiple machines in a distributed cluster and the highperformance CPUs or GPUs on each machine.There exists multiple challenges in this regard. Firstly, the entire CNN model contains multiple CNN subnetwork models that are trained in parallel on different machines, which requires synchronization and integration operations. It is required to minimize the synchronization waiting problem between subnetwork models, and guarantee the accuracy of the integrated model. Moreover, highquality CNN models often require largescale training dataset and a large number of iterations. Hence, an effective parallel mechanism should be carefully designed to minimize the data communication overhead between different iteration steps, tasks in different threads/CPUs/GPUs, and distributed computers. Furthermore, considering the heterogeneity of distributed computing clusters, computing nodes might be equipped with different CPU or GPU structures and have different training speed. How to partition the training dataset into these computers and how many parallel training tasks are started on each computer to maximize the computing power and workload balance of each computer. Finally, how to design a scalable parallel mechanism according to the characteristics of the CNN network, which can be easily deployed on elastic computing clusters and meet the application requirements in different areas.
In this paper, we aim to address the above challenges and fully utilize the parallel computing capacity of computing clusters and multicore CPU to accelerate the training process of largescale CNNs. We propose a Bilayered Parallel TrainingCNN (BPTCNN) architecture in distributed computing environments. The outerlayer parallelism is deployed in a distributed cluster to train data subsets in parallel with minimal data communication and maximal workload balance. The innerlayer parallelism is performed on each computer using multithreaded platforms. Experiments on largescale datasets indicate the advantages of the proposed BPTCNN in terms of performance, data communication, workload balance, and scalability. The contributions of this paper are summarized as follows.

In the outerlayer parallelism, an Incremental Data Partitioning and Allocation (IDPA) strategy is proposed to maximize the workload balance and minimize data communication among computers, where largescale training datasets are partitioned and allocated to computers in batches according to their computing power.

An Asynchronous Global Weight Updating (AGWU) strategy is proposed to integrate CNN subnetwork models from different computers and to address the synchronization waiting problem during the global weight update process.

In the innerlayer parallelism, two timeconsuming training steps of the CNN model are parallelized on each computer based on taskparallelism, including convolutional layer and local weight training process.

To achieve threadlevel load balancing and critical paths waiting time minimization, we introduce task decomposition and scheduling strategies for CNN training tasks with multithreaded parallelism.
The rest of the paper is organized as follows. Section 2 reviews the related work. Section 3 presents the BPTCNN architecture and the outerlayer parallelization process. Section 4 describes the innerlayer parallel training of BPTCNN. Experimental results and evaluations are discussed in Section 5. Finally, Section 6 concludes the paper with a discussion of future work and research directions.
2 Related Work
Previous works have proposed various hardware designs for CNNs and other deep learning algorithms acceleration [17, 8]. FPGAs have been widely explored as hardware accelerators for CNNs because of their reconfigurability and energy efficiency [18]. In [11], a parallel solution of CNNs was designed on manycore architecture, in which the model is parallelized on a new platform of Intel Xeon Phi Coprocessor with OpenMP. Caffe [15] provides multimedia scientists and practitioners with a clean and modifiable framework for stateoftheart deep learning algorithms. Chung et al. proposed a parallel Deep Neural Networks (DNNs) training approach for big data on the IBM Blue Gene/Q system, in which issues of regarding programming model and datadependent imbalances were addressed [18]. In [19], a massively parallel coprocessor was designed as a metaoperator for CNNs, which consists of parallel 2D convolution primitives and programmable units.
To efficiently handle largescale CNN and big data, outstanding distributed architecture based solutions were implemented in [20, 14]. Adam [21] is an efficient and scalable deep learning training system, optimizing and balancing workload computation and communication through entire system codesign. An energyefficient reconfigurable accelerator was presented in [22] for deep CNN. To minimize the energy cost of data movement for any CNN shape, a processing row stationary dataflow was introduced to reconfigure the computation mapping of a given shape. In [14], Dean et al
. introduced a distributed system (termed DistBelief) for training large neural networks on massive amounts of data. DistBelief uses two complementary types of parallelism: distributed parallel between multiple models and in each model, respectively. In addition, an asynchronous Stochastic Gradient Descent (SGD) procedure was employed to support a large number of model replicas.
Tensorflow [16]
is a popular framework for largescale machine learning on heterogeneous distributed systems. The computation model of Tensorflow is based on dataflow graphs with mutable state, where the graph nodes can be distributed and executed in parallel on different workers, multicore CPUs, and generalpurpose GPUs. In addition, Tensorflow uses a declarative programming paradigm, and developers can focus on the symbolic definition and computation logic instead of the implementation details. However, there are some shortcomings in Tensorflow: (a) TensorFlow attempts to occupy all available GPU memory in the initial phase, which makes the machines deploying the Tensorflow program infeasible to share with other applications, and (b) many highlevel operations and interfaces in TensorFlow are nested and chaotically packaged, making it difficult to customize programming. Hence, we study the parallelism idea of the DistBelief and Tensorflow approaches, and implement a bilayered parallel training architecture for largescale CNNs by combining the advantages of both distributed computing and CPU/GPU parallel computing.
Comparing with existing efforts, the proposed BPTCNN architecture in this paper fully utilizes the parallel capacity of both the distributed cluster and multicore CPU of individual machines. Benefitting from the proposed IDPA and AGWU strategies, we effectively improve the training performance of the CNN model and address the problems of data communication, synchronization, and workload balancing of distributed cluster. Moreover, according to task decomposition and scheduling strategies, BPTCNN achieves the optimization objectives of threadlevel load balancing and waiting time minimization of critical paths.
3 BPTCNN Architecture for CNNs
3.1 Convolutional Neural Networks
CNN model is one of the most representative network structures of DL technologies and has become one of the hot topics in various fields of science. The common architecture of a CNN network includes two components: a feature extractor and a fullyconnected classifier. In a convolutional layer, each batch of the input dataset is analyzed to obtain different abstract features. Given an input
with scale (), where , , and refer to the depth, height, and width of . Assuming that a filter with scale () is used in the convolutional layer to extract a feature map , then denotes the value of the th column of the th row of the current feature map, as calculated in Eq. (1):(1) 
where , , and are the depth, height, and width of the current filter, and
is an activation function, such as
, , or function [23].Pooling layer is utilized on each feature map to reduce the feature dimensions. There exist various pooling methods, i.e., max pooling and mean pooling. Fullyconnected layer is a classification layer of CNN, where all output features of convolutional or pooling layers are connected to all hidden neurons with weight parameters. An example of a CNN architecture is illustrated in Fig.
1.Massive training datasets and iterative training process guarantee the high precision of CNN. However, they become the performance bottleneck when the training network structure is complex and the computing power is insufficient. The characteristics and the corresponding limitations of CNNs are summarized as follows.

Massive weight parameters: There are a large number of weight parameters in the network. If there are different connection structures among layers, the network will become more complex, which requires more computation time cost.

Largescale training datasets and massive iterative operations for weight updating: High accuracy of CNNs requires highquality weights, which depends on largescale training datasets and iterative training.
3.2 Bilayered Parallel Training Architecture
To accelerate the training process of CNNs, we propose a bilayered parallel training architecture for largescale CNNs. We describe the distributed computing environment and training process of the BPTCNN architecture.
3.2.1 BPTCNN Architecture
BPTCNN architecture is composed of two main components: (a) an outerlayer parallel training for multiple CNN subnetworks on separate data subsets, and (b) an innerlayer parallel training for each subnetwork. The proposed BPTCNN architecture is illustrated in Fig. 2.
(1) Outerlayer parallel training. A dataparallelism strategy is adopted in the outerlayer parallel training process, where a largescale dataset is split into multiple subsets and allocated to different computing nodes to be trained in parallel. At the parameter server, the global weight parameters of the entire CNN network are updated depending on the local weights from each training branch. The updated global weight parameters are shared to each machine for the next iterative training.
(2) Innerlayer parallel training. The inner layer adopts a taskparallelism strategy to further accelerate the training process of each CNN subnetwork on each computer. Two timeconsuming computation tasks are parallelized, including convolutional layer and the local weight training process. Computation tasks on these processes are decomposed depending upon their logical and data dependence, and are executed with multithreaded parallelism.
3.2.2 Distributed Computing Cluster for BPTCNN
We construct a distributed computing cluster for the proposed BPTCNN architecture to efficiently handle massive training datasets and largescale CNN models. The distributed cluster mainly consists of a main server, several computing nodes with multcore CPU, and a parameter server, as shown in Fig. 3.
The main server is responsible for CNN training task management as well as data partition and allocation. It copies the CNN training network and allocates them to each computing node. Meanwhile, the training dataset is split into a series of subsets and allocated to the corresponding computing nodes. During the parallel training, the main server monitors the training time costs on computing nodes and migrates datasets for the optimization objective of synchronization delay minimization.
On each computing node, samples in the subset are calculated by the corresponding CNN subnetwork, while the network weight parameters are trained as a local weight set. The training process on each computer is executed in parallel. In addition, each computer is equipped with a multicore CPU platform. In an innerlayer parallel training, the training process of each CNN subnetwork is further parallelized using multithreaded parallelism.
The parameter server collects the trained local weight parameters from each computing node and updates the global weight parameters. Then, the updated global weight set is reallocated to each computing node for the next epoch of training.
3.3 Outerlayer Parallel Training of BPTCNN
In BPTCNN’s outerlayer parallel training architecture, we address critical issues of distributed and parallel computing, including data communication, synchronization, and workload balancing. Firstly, considering the heterogeneity of the distributed computing cluster, we propose an incremental data partitioning and allocation strategy to maximize cluster’s workload balancing and minimize data communication overhead. In addition, we propose an asynchronous global weight updating strategy to further minimize the synchronous wait in the global weight update process.
3.3.1 Incremental Data Partitioning and Allocation Strategy
Considering the heterogeneity of computing nodes and their different training speed, to maximize the workload balance of the distributed cluster and minimize the synchronization in global weight update process, we propose an Incremental Data Partitioning and Allocation (IDPA) strategy based on heterogeneous sensing. As there are no dependencies between training samples, they can be partitioned and allocated in batches instead of done at once, according to the computing power of the computing nodes. Assume that there are training samples, computing nodes in a distributed cluster, and training iterations are required for the CNN model. Let () be the number of batches of data partitioning, that is, the entire training dataset is incrementally partitioned and allocated in times, and each time new samples are processed.
Initially, we take the first samples as the training dataset in the first batch. Before the training iteration, we use the constant characteristics of the computing nodes to represent their heterogeneity, i.e., the CPU/GPU frequency is measured. Let be the CPU/GPU frequency of computing node , and the number of samples that will be partitioned and allocated to is calculated as:
(2) 
After receiving the training samples, each computing node begins the first iteration of training. At the same time, we monitor the execution time of each computing node to complete the iteration and evaluate its actual computing power. Being of the opinion that there might be more applications from different employers executing on the compute nodes, although we can predict the computing power based on the computing node’s CPU/GPU frequency, it is more accurate to evaluate its actual computing power by actual execution time. Therefore, after the first training iteration, we can partition the training dataset according to the actual computing power of the machines. Let be the execution time of computing node to train samples in the current iteration, then we can get the average execution time of for a sample as . We collect the execution time of the computing nodes in the current iteration and predict the execution time required by all computing nodes in the next iteration. Note that new samples will be partitioned and allocated in the th batch. Namely, there is a total of samples on the computing nodes. The average execution time of the computing nodes in the th training iteration is calculated as:
(3) 
where is the average execution time for training a sample by any compute node. To minimize the synchronization latency among computing nodes during the global weight update process, we expect all nodes to complete each iteration as close as possible. Assume that samples on after the th batch partitioning and allocation, we can calculate the value of as:
(4) 
Accordingly, we can obtain the number of samples that can accept in the th batch allocation according to its actual computing power, as calculated as:
(5) 
Repeat this process times until the entire training dataset is partitioned and allocated to the heterogeneous computing cluster. By considering the heterogeneity of computing nodes, each computing node receives a corresponding number of training samples based on its actual computing power. The total number of samples allocated to each computing node is denoted as , and . The detail steps of the IDPA strategy are described in Algorithm 1.
Benefitting from the IDPA strategy, the training dataset is well partitioned and allocated to the computing nodes, allowing them to complete each iteration in same duration, achieving minimal synchronization delay and maximum workload balancing. Moreover, no data migration is required among compute nodes during the training process, thereby no unnecessary data communication overhead is incurred.
Recall that iterations are required for the CNN model, that is, each sample has an average of times to train the weight parameter set of the CNN network model. After the iterations in the data partitioning process, each computing node continues to execute the remaining iterations on the samples. Since these samples are incrementally allocated to the computing nodes, the actual training times of samples in the iterations are instead of . Therefore, we should recalculate the remaining iterations of the training process, as defined below:
(6)  
The total number of training iterations of the CNN model is . To simplify the expression, we denote as in the remaining context.
3.3.2 Global Weights Updating Strategies
There are massive connections with different weight parameters among all layers in a CNN network. We define these weight parameters as a weight set. We need to collect the training results on each computing node to update the global weight set for the entire CNN network. In this section, we propose two global weight updating strategies for the CNN network. We respectively define the local weight of each CNN subnetwork and the global weight of the entire CNN network as follows.
Definition 1: Local weight set. The weight parameters among all training layers of a CNN training network are denoted as a weight set. At each computing node, the weight set of a CNN subnetwork is defined as the local weight set of the corresponding subnetwork. The local weight set is trained based on the related data subset. In a distributed computing cluster, there is a local weight set on each computing node, which is updated after training a sample.
Definition 2: Global weight set. The weight set of the entire CNN network is defined as the global weight set. We provide a parameter server for calculating the global weight set by combining parts or all of the local weight sets. The global weight set is aggregated by each local weight set and shared to all computing nodes for the next epoch of training.
(1) Synchronous global weight updating strategy.
We propose a Synchronous Global Weight Updating (SGWU) strategy for BPTCNN, where each computing node trains all the samples of the current subset and updates the local weight set for an iteration. The local weight sets trained by all computing nodes in the current iteration are gathered at the parameter server, where a new version of the global weight set is generated. The workflow of the SGWU strategy is illustrated in Fig. 4.
Considering that different local weight sets are trained by the corresponding subsets on different computers, having different contributions for the global weight set. We verify the accuracy of each CNN subnetwork after completing an epoch of local iteration training and use it as the contribution of the local weight set. After all computers finish an epoch of local iteration training, the latest local weight set trained on each computer is aggregated to the parameter server to update the global weight set as a new version . The global weight set for the th epoch of iteration training is defined in Eq. (7):
(7) 
where and are the local weight set and the corresponding accuracy of the CNN subnetwork on computer , which is obtained in the th epoch of local iteration training.
In a distributed computing cluster, especially for one equipped with heterogeneous computers, although we use the IDPA strategy to maximize cluster’s workload balancing, the SGWU strategy inevitably faces the synchronization problem during the global weight update process. Due to the different available computing capabilities, computers need different time costs to execute each training iteration. Let be the execution duration for the th training iteration on computer . The waiting time for synchronization of the entire computing cluster is defined in Eq. (8):
(8) 
where is the number of iteration training and is the number of computing nodes.
(2) Asynchronous global weight updating strategy.
To address the synchronization problem of SGWU, we propose an Asynchronous Global Weight Updating (AGWU) strategy. In AGWU, once a computing node completes a training iteration on the local samples, the updated local weight set is submitted to the parameter server to immediately generate a new version of the global weight set, without waiting for other computing nodes. Compared with SGWU, AGWU can effectively solve the synchronization waiting problem without increasing the communication overhead. The workflow of the AGWU strategy is shown in Fig. 5.
Considering the heterogeneity of computing nodes, according to the IDPA strategy, each computing node may contain different scales of training subset. In addition, due to the different training speeds, each computing node may also submit its local weight set to the parameter server at different time points, and get different versions of the global weight set. For example, for a computing node with samples, we assume that train the local weight set based on based on the version of the global weight set in the current iteration. During the current training iteration of , the global weight set has been updated from to by other computing nodes. In this case, the low speed computers train the local weight set based on the old version of the global weight set, while the high speed computers based on the newer version. Denote () as the increment between the submitted local weight set and its base version global weight set . Assume that there is another local weight set on and it is trained based on the version of the global weight set, where . It is easy to know that has less impact than in the process of updating . Hence, we can conclude that the local weight sets using the old version of the global weight set have less impact on the global weight updating than those using the new version of the global weight set. Therefore, we adopt a time attenuation factor to measure the impact of each local weight set to the current global weight update process. Denote as the time attenuation factor of the local weight set submitted from , as calculated in Eq. (9):
(9) 
where is the latest version of the global weight set, and is the version of the global weight set that used to train .
Since there is no dependence among the training subsets on different computing nodes, the global weight update process does not require the training results from all computing nodes at the same time. Once a local weight set is submitted, the current global weight set is immediately updated to a new version , without waiting for other computing nodes. The th version of the global weight set is updated as:
(10)  
where is the update component from , and is the accuracy of the CNN subnetwork on computer , which is evaluated by the output of the current local iteration training.
After obtaining the updated global weight set , the parameter server shares to for the next iteration training. Subsequently submitted local weight sets from other computing nodes will update the global weight set based on the latest version. The steps of the AGWU strategy of BPTCNN is described in Algorithm 2.
In comparison with the SGWU strategy, in the AGWU strategy, each computing node independently participates in the global weight update process, so there is no synchronization waiting problem in AGWU. Furthermore, from the perspective of the entire training process, the update of the global weight set depends on the training results of all compute nodes. According to Eq. (7) and Eq. (10), the global weight set is updated based on the local weight set and the corresponding accuracy of the trained mode in both of SGWU and AGWU strategies.
(3) Data communication of global weight updating.
In BPTCNN, data communication only incurs between each computing node and the parameter server for the global weight updating and sharing. In AGWU, to reduce the synchronization cost and data communication overhead from the perspective of computing nodes, after receiving a version of the global weight set, each computing node begin a training iteration. It does not participate in the global weight update and receive a new version before completing the current iteration. In both of SGWU and AGWU strategies, the global weight set is updated for every epoch of iteration training. Therefore, both strategies produce the same data communication overhead. Denote as the number of CNN iteration training, data communication in SGWU and in AGWU between the parameter server and all computing nodes is calculated in Eq. (12):
(11) 
where is a unit communication cost for transmitting the global weight set between the parameter server and a computing node. For each update of the global weight set, there exist 2 iterations of data communication: (1) submitting the local weight set from a computing node to the parameter server, and (2) sharing the updated global weight set from the latter to the former.
4 Innerlayer Parallel Training of BPTCNN
In the innerlayer parallel training of BPTCNN, we further parallelize the training process for each CNN subnetwork on each computing node. Two timeconsuming training steps are parallelized based on taskparallelism, including convolutional layer and the weight training process. In addition, we propose task decomposition and scheduling solutions to realize threadlevel load balancing and critical paths waiting time minimization.
4.1 Parallel Computing Models of CNN Training Process
4.1.1 Parallelization of Convolutional Layer
In the training process of a CNN network, convolutional layers take more than 85.18% of the total training duration, but only train 5.32  6.63% of the weight parameters [d14]. Fortunately, the matrixparallelbased method provides an effective way of performing convolutional operations in parallel. We introduce the parallel mechanism of the convolutional operations into the innerlayer parallel training architecture of BPTCNN. We use the data partitioning method of the input matrix in CNN and extract all convolution areas from the input matrix. Then, by sharing the filter matrix, all convolution areas are convoluted in parallel with the shared filter matrix.
Given an input matrix with the shape of (), where , , and are the depth, height, and width of . Providing a filter parameter matrix with the shape of (), a feature map is generated via convolutional multiplication on and . Based on the scales of and , the shape of is calculated as:
(12)  
where , , and are the depth, height, and width of , respectively. Based on the scales of , , and , we calculate the times of convolutional operations in the current convolutional layer, which will be executed in parallel. is calculated in Eq. (13):
(13) 
where
is the stride of the convolutional operation and
is the number of the zero padding, which means appending
laps elements around with the value of 0.To execute these operations in parallel, we need to identify the convolution areas of the input matrix for each task. A convolution area of includes the begin and end rows and columns. In each convolutional operation task, an elementbyelement multiplication is executed on and to generate the corresponding element of . For each element in , location indexes of the convolution area in is calculated in Eq. (14):
(14)  
After obtaining location indexes of each convolution area, we extract the contents of different convolution areas and perform the related convolutional operations in parallel, without waiting for the end of the previous convolutional operations. These parallel convolutional operations on different areas access the input and filter matrices repeatedly and simultaneously from the same memory without updating the contents. Without data dependence among these tasks, different tasks can access different convolution areas in simultaneously. An example of the parallel convolutional operation of each CNN subnetwork in BPTCNN is illustrated in Fig. 6 and the steps of this process are described in Algorithm 1.
As defined in Eq. (13), the maximum parallelism degree of a convolutional layer is equal to the number of elements of the output feature map, which is computed according to the scale of and . The total execution duration of a convolutional layer is calculated in Eq. (15):
(15) 
where is the number of elements in and is the execution duration of the th operation task.
4.1.2 Parallelization of Local Weight Training Process
To distinguish the weight set of the entire CNN network and that of each CNN subnetwork, we respectively define the global weight set and local weight sets in Section 3.3.2. In this section, training process of the local weight set of each CNN subnetwork is parallelized on each computer.
After obtaining the outputs of a CNN subnetwork, the error (loss function) of each layer is evaluated from the output layer to the first convolutional layer using the Back Propagation (BP) method. The Stochastic Gradient Descent (SGD) process
[24, 25] is involved in updating the weight parameters among all layers of the current CNN subnetwork. In the output layer, the square error of all neurons is taken as the objective function of weight training, as defined in Eq. (16):(16) 
where denotes the loss function of the input , and and are the label and the output of the neuron in the output layer, respectively. The error of is the inverse of the partial derivative of the error of the input of , as calculated in Eq. (17):
(17) 
where is the input of the neuron that connected with , that is, is the output of . is the weight of the connection between neurons and .
Let be the set of errors of neurons in the th layer . Based on , the error set of neurons in is calculated in Eq. (18):
(18) 
where is the weight set of and is the weighted input of , as defined as:
(19)  
where is the output matrix of , consisting of each element . An example of the calculation process of loss function between layers and is shown in Fig. 7.
We parallelize the process for the loss function calculation, where the errors of neurons in the same layer are computed in parallel. In the convolutional layer, each neuron in the output layer (a feature map) is connected to a part of neurons in the input layer (an input matrix). In such a case, the error calculation of neurons in the previous layer depends on the results of a part of neurons in the next layer . Hence, we parallelize this process depending on neurons in . An example of the loss function calculation parallelization is shown in Fig. 8.
After obtaining the error set of neurons in , we calculate the error of each neuron in . Let be the error component of neuron in for in , as defined as:
(20) 
where and are the height and width of the filter parameter matrix between and . Based on the error set of neurons, the weight parameters of are computed subsequently. The gradient of each weight is calculated in parallel, as defined in Eq. (21):
(21) 
The gradient of the bias weight is computed in Eq. (22):
(22) 
Based on the gradient values, each weight is updated in Eq. (23):
(23) 
where is the learning rate of the CNN network.
4.2 Implementation of Innerlayer Parallel Training
We implement the innerlayer parallel training of BPTCNN on computing nodes equipped with multicore CPUs. Based on the parallel models proposed in the previous section, computing tasks of these training phases are decomposed into several subtasks. The workflow of task decomposition for a CNN subnetwork is illustrated in Fig. 9.
(1) Task priority marking.
According to the logical and data dependence of the decomposed subtasks, a task Directed Acyclic Graph (DAG) is created. With the threadlevel load balancing and completion time minimization as the optimization goal, the priorities of tasks in the task DAG are marked. We set a maximum value for the entrance task of the task DAG graph. Then, the priorities of tasks in each level are set according to the tasks’ level. Specifically, upstream tasks’ priorities are higher than that of downstream tasks, while tasks at the same level have the same priority.
(2) Task scheduling and execution.
Based on the priorities of tasks, we allocate these tasks to threads on the multicore CPU platform using the priority task scheduling algorithm [26]. Based on the task priorities, tasks of the entire CNN training network are allocated to different threads on the different CPU cores. An example of the task scheduling of the CNN training network with multithreaded parallelism is illustrated in Fig. 10.
5 Experiments
5.1 Experimental Settings
All of the experiments are conducted on a distributed computing cluster built with 30 highperformance computing nodes, and each of them is equipped with Intel Xeon Nehalem EX CPU and 48 GB main memory, respectively. Each NehalemEX processor features up to 8 cores inside a single chip supporting 16 threads and 24MB of cache. Comparison experiments are conducted to evaluate the proposed BPTCNN by comparing with Tensorflow CNN [16], DisBelief [14], and DCCNN [23]
algorithms, in terms of accuracy and performance evaluation. Largescale public image datasets from ImageNet
[3] with 14,197,122 samples are used in the experiments.5.2 Accuracy Evaluation
We evaluate the accuracy of BPTCNN by comparing with Tensorflow, DisBeilef, and DCCNN. For each algorithm, fivefold experiments on the ImageNet dataset with 100 epoch iterations are conducted and the average values of accuracy and the Area Under the Curve (AUC) are compared. The experimental results of accuracy and AUC of the comparison algorithms are presented in Fig. 11.
As shown in Fig. 11 (a) and (b), BPTCNN achieves the similar accuracy with compared algorithms, as well as higher AUC values in most of the cases. The average value of accuracy of BPTCNN is equal to 0.744, while that of Tensorflow, DisBelief, and DCCNN is 0.721, 0.722, and 0.639, respectively. Because of the parallel training and global weight updating, BPTCNN narrows the impact of local overfitting and obtains more stable and robust global network weights. As the epoch of iteration training increases, both of accuracy and AUC of BPTCNN steadily increases. AUC of BPTCNN is greater than that of Tensorflow by 5.91%, on average, 9.56% higher than that of DisBelief, and 10.09% higher than that of DCCNN. Therefore, compared with Tensorflow, DisBelief, and DCCNN, BPTCNN does not reduce the accuracy of CNNs. Moreover, benefitting from the global weight updating strategy, BPTCNN achieves more robustness than compared algorithms.
5.3 Performance Evaluation
5.3.1 Execution Time of Comparison Algorithms
The execution time of these algorithms is compared using 100 training iterations in various configurations: different data sizes and computing cluster scales. The comparison of the average execution time of each algorithm in each case is shown in Fig. 12.
As can be seen in Fig. 12 (a) and (b), the proposed BPTCNN algorithm achieves higher performance than the compared algorithms in most of the cases. Benefitting from the dataparallelism strategy, when the data size increases, the volume of each partitioned subset on each computer is slightly increased, leading to a slight increase in the average workload of each computer. For example, when the number of training samples increases from 100,000 to 700,000, the execution time of BPTCNN rises from 62.77s to 307.35s, while that of Tensorflow increases from 54.38s to 454.23s, and that of DCCNN sharply increases from 91.21s to 929.74s. In addition, taking advantage of the IDPA strategy, the proposed BPTCNN algorithm owns scalability over the compared algorithms. When the scale of the computing cluster expended, the execution time of BPTCNN and Tensorflow is significantly reduced. Experimental results indicate that BPTCNN achieves high performance and scalability in distributed computing clusters.
5.3.2 Execution Time Comparison for Fixed Accuracy
Considering the different training architectures of various comparison algorithms, we discuss how these algorithm trade off performance and accuracy with resource consumption. We discuss the training iterations required for each algorithm to achieve different accuracy, and then measure the execution time each algorithm takes under different computing resources. The comparison results are shown in Table I and Fig. 13.
Accuracy  BPTCNN  Tensorflow  DisBelief  DCCNN 

0.650  7  7  9  12 
0.700  18  15  22  28 
0.750  42  64  85  147 
0.800  97  187  211   
From Table I, all algorithms use similar iterations to achieve an accuracy of 0.650. However, to achieve higher accuracy, BPTCNN requires fewer iterations than Tensorflow, DisBelief, and DCCNN. For example, BPTCNN requires 42 iterations to achieve an accuracy of 0.750, while Tensorflow uses 64 and DisBelief uses 85, and DCCNN requires up to 147. In addition, to achieve an accuracy of 0.750, we compare the actual execution times of each algorithm under different numbers of computing nodes and CPU cores, as shown in Fig. 13 (a) and (b). When the scale of the computing cluster and CPU cores expended, the execution time of BPTCNN and Tensorflow is significantly reduced. In contrast, the execution time of DisBelief and DCCNN algorithms is increased when the cluster scale reaches a certain amount (e.g., 2535), which is caused by the more data communication among the increasing machines. Experimental results indicate that BPTCNN achieves higher accuracy and performance than other algorithms using the same computing resource. Moreover, when the scale of computing nodes and CPU cores increases, the performance benefits of BPTCNN is more noticeable.
5.3.3 Execution Time of BPTCNN with Different Strategies
We further evaluate the performance of the proposed BPTCNN under different global weight update and data partitioning strategies. To evaluate the effectiveness of the IDPA strategy, we perform the same work using the Uniform Data Partitioning and Allocation (UDPA) strategy, where the training dataset is uniformly partitioned into partitions and allocated into the computers. Comparison experiments are conducted in terms of data size, computing cluster scale, CNN network scale, and thread size. The average execution time of BPTCNN with different strategies is presented in Fig. 14.
In Fig 14 (a), 7 different scales of CNN network are constructed in the experiments, as described in Table II. Here “layers(Conv)” and “filters(Conv)” denote the number of the convolutional layer and that of filters at each layer, respectively. “layers(FC)” and “neurons(FC)” denote the number of layers in the fullyconnected layers and number of neurons in each layer, respectively.
Scales  case1  case2  case3  case4  case5  case6  case7 

layers(Conv)  2  4  6  8  8  10  10 
filters(Conv)  4  4  8  8  10  10  12 
layers(FC)  3  3  5  5  7  7  7 
neurons(FC)  500  1000  1500  1500  2000  2000  2000 
By comparing strategies AGWU and SGWU, the execution time of BPTCNN using AGWU is obviously lower than SGWU in most cases. In AGWU, because of the asynchronous update of the global weight set, each computing node uses the minimum time to wait for the global weight update and trains almost continuously. In addition, by comparing data partitioning strategies IDPA and UDPA, benefitting from the incrementa data partitioning, the workload of computing nodes stays well balanced, which further shortens the waiting time among different machines. As shown in each case in Fig. 14, the execution time of BPTCNN with IDPA is significantly lower than that with UDPA strategy. Hence, BPTCNN using AGWU+IDPA strategies exhibits the most efficient performance against other cases. Moreover, with the increase of data size or CNN network scale, the execution time of BPTCNN using AGWU+IDPA maintains a slow rise. When the computing cluster scale and the number of threads on each machine increases, the benefits of AGWU+IDPA are more noticeable. Taking advantage of the IDPA and AGWU strategies, BPTCNN achieves significant strength in terms of performance.
5.4 Data Communication and Workload Balancing
We evaluate the proposed BPTCNN architecture in the view of data communication overhead and workload balancing by comparing with Tensorflow, DisBeilef, and DCCNN algorithms. 600,000 training samples are used in the experiments, and the number of computing nodes increases from 5 to 35 in each case. Experiment results of data communication and workload balancing are shown in Fig. 15.
It is clear from Fig. 15 (a) and (b) that, in most cases, BPTCNN owns significant workload balancing and lower data communication costs than other algorithms. Due to the use of the IDPA strategy in BPTCNN, there is only communication overhead between the computing nodes for transmitting local/global weight parameters, and no training sample migration is required. Hence, as the number of computing nodes increases from 5 to 35, the communication overhead of BPTCNN slowly increases from 2.35 MB to 11.44 MB. In contrast, due to dynamic resource scheduling, Tensorflow generates communication overhead from 2.73 MB for 5 computers to 45.23 MB between 35 computers. Moreover, to achieve workload balancing, DisBelief and DCCNN use data migration operations during training, which results in heavy communication overhead between computers.
We compare the workload balance of each algorithm under different scales of the computing cluster, as shown in Fig. 15 (b). Our BPTCNN architecture considers the heterogeneity of compute nodes and allocates corresponding workloads based on the actual computing power of each compute node. Hence, as the scale of the cluster increases, BPTCNN achieves a stable workload balance, keeping between 0.89 and 0.80. In contrast, without heterogeneityaware data allocation, the workload of other comparison algorithms is not as balanced as BPTCNN. The unbalanced workload further leads to long waiting time for synchronization and more execution time for the entire CNN network. Experimental results demonstrate that BPTCNN significantly improves the workload balance of the distributed computing cluster with acceptable communication overhead.
6 Conclusions
This paper presented a bilayered parallel training architecture to accelerate the training process of largescale CNNs. In the outerlayer parallel training, the performance of the entire CNN network is significantly improved based on dataparallelism optimization, where the issues of data communication, workload balance, and synchronization, are well addressed. In the innerlayer parallelism, the training process of each CNN subnetwork is further accelerated using taskparallelism optimization. Extensive experimental results on largescale datasets indicate that the proposed BPTCNN effectively improves the training performance of CNNs in distributed computing clusters with minimum data communication and synchronization waiting.
For future work, we will further concentrate on scalable CNN models and the parallelization of deep learning algorithms on highperformance computers. In addition, development of deep learning algorithms specific applications is also an interesting topic, such as scalable CNNs for images and LSTMs for time series.
Acknowledgment
This research is partially funded by the National Key R&D Program of China (Grant No. 2016YFB0200201), the Key Program of the National Natural Science Foundation of China (Grant No. 61432005), the National Outstanding Youth Science Program of National Natural Science Foundation of China (Grant No. 61625202), the International Postdoctoral Exchange Fellowship Program (Grant No. 2018024), and the China Postdoctoral Science Foundation funded project (Grant No. 2018T110829). This work is also supported in part by NSF through grants IIS1526499, IIS1763325, CNS1626432, and NSFC 61672313.
References
 [1] A. Coates, B. Huval, T. Wang, D. J. Wu, and A. Y. Ng, “Deep learning with cots hpc systems,” in ICML’13, 2013, pp. 1337–1345.
 [2] R. Gemulla, E. Nijkamp, P. J. Haas, and Y. Sismanis, “Largescale matrix factorization with distributed stochastic gradient descent,” in KDD’11, 2011, pp. 69–77.
 [3] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and F. Li, “Imagenet: A largescale hierarchical image database,” in IEEE CVPR’09, 2009, pp. 248–255.
 [4] G. Heigold, V. Vanhoucke, A. Senior, P. Nguyen, M. Ranzato, M. Devin, and J. Dean, “Multilingual acoustic models using distributed deep neural networks,” in ICASSP’13, 2013, pp. 8619–8623.
 [5] L. Cavigelli, D. Gschwend, C. Mayer, S. Willi, B. Muheim, and L. Benini, “Origami: A convolutional network accelerator,” in Proceedings of the 25th edition on Great Lakes Symposium on VLSI, 2015, pp. 199–204.
 [6] L. Song, Y. Wang, Y. Han, X. Zhao, B. Liu, and X. Li, “Cbrain: a deep learning accelerator that tames the diversity of cnns through adaptive datalevel parallelization,” in DAC’16, 2016, p. 123.
 [7] B. Recht, C. Re, S. Wright, and F. Niu, “Hogwild: A lockfree approach to parallelizing stochastic gradient descent,” in NIPS’11, 2011, pp. 693–701.
 [8] S. Fan, J. Fei, and L. Shen, “Accelerating deep learning with a parallel mechanism using cpu+mic,” International Journal of Parallel Programming, pp. 1–14, 2017.
 [9] L. Jin, Z. Wang, R. Gu, C. Yuan, and Y. Huang, “Training large scale deep neural networks on the intel xeon phi manycore coprocessor,” in IEEE IPDPS’14, 2014, pp. 1622–1630.
 [10] Q. Ho, J. Cipar, H. Cui, S. Lee, J. K. Kim, P. B. Gibbons, G. A. Gibson, G. Ganger, and E. P. Xing, “More effective distributed ml via a stale synchronous parallel parameter server,” in NIPS’13, 2013, pp. 1–9.
 [11] J. Liu, H. Wang, D. Wang, Y. Gao, and Z. Li, “Parallelizing convolutional neural networks on intel many integrated core architecture,” in International Conference on Architecture of Computing Systems, 2015, pp. 71–82.

[12]
A. A. Huqqani, E. Schikuta, S. Ye, and P. Chen, “Multicore and gpu parallelization of neural networks for face recognition,”
Procedia Computer Science, vol. 18, pp. 349–358, 2013.  [13] D. Strigl, K. Kofler, and S. Podlipnig, “Performance and scalability of gpubased convolutional neural networks,” in Euromicro Conference on Parallel, Distributed and Networkbased Processing, 2010, pp. 317–324.
 [14] J. Dean, G. Corrado, and R. M. et al., “Large scale distributed deep networks,” in NIPS’12, 2012, pp. 1223–1231.
 [15] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” in ACMMM’14, 2014, pp. 675–678.
 [16] M. Abadi, A. Agarwal, and P. Barham, “Tensorflow: Largescale machine learning on heterogeneous distributed systems,” CoRR, vol. abs/1603.04467, no. 5, pp. 725–730, 2016.

[17]
M. Mohammadi, A. Krishna, N. S., and S. K. Nandy, “A hardware architecture for radial basis function neural network classifier,”
IEEE TPDS, vol. 29, no. 3, pp. 1829–1844, 2017.  [18] I.H. Chung, T. N. Sainath, B. Ramabhadran, and M. P. et al., “Parallel deep neural network training for big data on blue gene/q,” IEEE TPDS, vol. 28, no. 6, pp. 1703–1714, 2017.
 [19] M. Sankaradas, V. Jakkula, and S. Cadambi, “A massively parallel coprocessor for convolutional neural networks,” in ASAP’09, 2009, pp. 53–60.
 [20] J. Bilski and J. Smolag, “Parallel architectures for learning the rtrn and elman dynamic neural networks,” IEEE TPDS, vol. 26, no. 9, pp. 2561–2570, 2015.
 [21] T. Chilimbi, Y. Suzue, J. Apacible, and K. Kalyanaraman, “Project adam: Building an efficient and scalable deep learning training system,” in USENIX OSDI’14, 2014, pp. 571–582.
 [22] Y. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energyefficient reconfigurable accelerator for deep convolutional neural networks,” IEEE Journal of Solidstate Circuits, vol. 52, no. 1, pp. 127–138, 2017.
 [23] S. Chakradhar, M. Sankaradas, V. Jakkula, and S. Cadambi, “A dynamically configurable coprocessor for convolutional neural networks,” in ISCA’10, 2010, pp. 247–257.
 [24] Q. V. Le, J. Ngiam, A. Coates, A. Lahiri, B. Prochnow, and A. Y. Ng, “On optimization methods for deep learning,” in ICML’11, 2011, pp. 265–272.
 [25] M. Zinkevich, M. Weimer, L. Li, and A. J. Smola, “Parallelized stochastic gradient descent,” in NIPS’10, 2010, pp. 1–9.
 [26] L. Zhang, K. Li, C. Li, and K. Li, “Biobjective workflow scheduling of the energy consumption and reliability in heterogeneous computing systems,” Information Sciences, vol. 379, pp. 241–256, 2017.
Comments
There are no comments yet.