In recent years, Deep Learning (DL) techniques have achieved promising results in various domains [1, 2]. Convolutional Neural Network (CNN) algorithm is an important branch of DL. Benefitting from large-scale training datasets and the complex training network, CNN achieves high accuracy and is widely applied in various domains, such as image classification , speech recognition , and text processing . However, the training process of CNN is very time-consuming, in which large amounts of training samples and iterative operations are required to obtain high-quality weight parameters [6, 7]. It is critical to accelerate the training process and improve the performance of CNN. Cloud computing, high-performance computing cluster, and supercomputing provides strong computing power for various applications [8, 9, 10]. Therefore, it is a critical issue that how to design an effective parallel CNN training model based on distributed computing clusters and address the challenges of data communication, synchronization, and workload balancing, while maintaining high performance and high accuracy.
Numerous enhancements were proposed to accelerate the CNN and DL algorithms by improving the execution environments, mainly focusing in two directions [11, 12, 13, 8, 14, 15, 16]: (1) using multi-core CPUs or GPUs platform to provide high-speed processing capacities, and (2) using multiple distributed computers to increase the available computing power. The CPU/GPU based methods [12, 13, 8] can perform more arithmetic operations and are suitable for the training of modestly sized DL models. However, a known limitation of the these methods is that the training acceleration is small when the scale of training datasets or DL models exceeds the GPU memory capacity. To process large-scale datasets and DL models, some distributed architecture based solutions were proposed, such as DistBelief
, and Tensorflow. Each of these approaches achieves significant progress in different perspectives, with implicit own complications in engineering and implementation. Considerable improvements might be obtained by combining the advantages in both aspects and make use of the computing power of multiple machines in a distributed cluster and the high-performance CPUs or GPUs on each machine.
There exists multiple challenges in this regard. Firstly, the entire CNN model contains multiple CNN subnetwork models that are trained in parallel on different machines, which requires synchronization and integration operations. It is required to minimize the synchronization waiting problem between subnetwork models, and guarantee the accuracy of the integrated model. Moreover, high-quality CNN models often require large-scale training dataset and a large number of iterations. Hence, an effective parallel mechanism should be carefully designed to minimize the data communication overhead between different iteration steps, tasks in different threads/CPUs/GPUs, and distributed computers. Furthermore, considering the heterogeneity of distributed computing clusters, computing nodes might be equipped with different CPU or GPU structures and have different training speed. How to partition the training dataset into these computers and how many parallel training tasks are started on each computer to maximize the computing power and workload balance of each computer. Finally, how to design a scalable parallel mechanism according to the characteristics of the CNN network, which can be easily deployed on elastic computing clusters and meet the application requirements in different areas.
In this paper, we aim to address the above challenges and fully utilize the parallel computing capacity of computing clusters and multi-core CPU to accelerate the training process of large-scale CNNs. We propose a Bi-layered Parallel Training-CNN (BPT-CNN) architecture in distributed computing environments. The outer-layer parallelism is deployed in a distributed cluster to train data subsets in parallel with minimal data communication and maximal workload balance. The inner-layer parallelism is performed on each computer using multi-threaded platforms. Experiments on large-scale datasets indicate the advantages of the proposed BPT-CNN in terms of performance, data communication, workload balance, and scalability. The contributions of this paper are summarized as follows.
In the outer-layer parallelism, an Incremental Data Partitioning and Allocation (IDPA) strategy is proposed to maximize the workload balance and minimize data communication among computers, where large-scale training datasets are partitioned and allocated to computers in batches according to their computing power.
An Asynchronous Global Weight Updating (AGWU) strategy is proposed to integrate CNN subnetwork models from different computers and to address the synchronization waiting problem during the global weight update process.
In the inner-layer parallelism, two time-consuming training steps of the CNN model are parallelized on each computer based on task-parallelism, including convolutional layer and local weight training process.
To achieve thread-level load balancing and critical paths waiting time minimization, we introduce task decomposition and scheduling strategies for CNN training tasks with multi-threaded parallelism.
The rest of the paper is organized as follows. Section 2 reviews the related work. Section 3 presents the BPT-CNN architecture and the outer-layer parallelization process. Section 4 describes the inner-layer parallel training of BPT-CNN. Experimental results and evaluations are discussed in Section 5. Finally, Section 6 concludes the paper with a discussion of future work and research directions.
2 Related Work
Previous works have proposed various hardware designs for CNNs and other deep learning algorithms acceleration [17, 8]. FPGAs have been widely explored as hardware accelerators for CNNs because of their reconfigurability and energy efficiency . In , a parallel solution of CNNs was designed on many-core architecture, in which the model is parallelized on a new platform of Intel Xeon Phi Coprocessor with OpenMP. Caffe  provides multimedia scientists and practitioners with a clean and modifiable framework for state-of-the-art deep learning algorithms. Chung et al. proposed a parallel Deep Neural Networks (DNNs) training approach for big data on the IBM Blue Gene/Q system, in which issues of regarding programming model and data-dependent imbalances were addressed . In , a massively parallel coprocessor was designed as a meta-operator for CNNs, which consists of parallel 2D convolution primitives and programmable units.
To efficiently handle large-scale CNN and big data, outstanding distributed architecture based solutions were implemented in [20, 14]. Adam  is an efficient and scalable deep learning training system, optimizing and balancing workload computation and communication through entire system co-design. An energy-efficient reconfigurable accelerator was presented in  for deep CNN. To minimize the energy cost of data movement for any CNN shape, a processing row stationary dataflow was introduced to reconfigure the computation mapping of a given shape. In , Dean et al
. introduced a distributed system (termed DistBelief) for training large neural networks on massive amounts of data. DistBelief uses two complementary types of parallelism: distributed parallel between multiple models and in each model, respectively. In addition, an asynchronous Stochastic Gradient Descent (SGD) procedure was employed to support a large number of model replicas.
is a popular framework for large-scale machine learning on heterogeneous distributed systems. The computation model of Tensorflow is based on dataflow graphs with mutable state, where the graph nodes can be distributed and executed in parallel on different workers, multi-core CPUs, and general-purpose GPUs. In addition, Tensorflow uses a declarative programming paradigm, and developers can focus on the symbolic definition and computation logic instead of the implementation details. However, there are some shortcomings in Tensorflow: (a) TensorFlow attempts to occupy all available GPU memory in the initial phase, which makes the machines deploying the Tensorflow program infeasible to share with other applications, and (b) many high-level operations and interfaces in TensorFlow are nested and chaotically packaged, making it difficult to customize programming. Hence, we study the parallelism idea of the DistBelief and Tensorflow approaches, and implement a bi-layered parallel training architecture for large-scale CNNs by combining the advantages of both distributed computing and CPU/GPU parallel computing.
Comparing with existing efforts, the proposed BPT-CNN architecture in this paper fully utilizes the parallel capacity of both the distributed cluster and multi-core CPU of individual machines. Benefitting from the proposed IDPA and AGWU strategies, we effectively improve the training performance of the CNN model and address the problems of data communication, synchronization, and workload balancing of distributed cluster. Moreover, according to task decomposition and scheduling strategies, BPT-CNN achieves the optimization objectives of thread-level load balancing and waiting time minimization of critical paths.
3 BPT-CNN Architecture for CNNs
3.1 Convolutional Neural Networks
CNN model is one of the most representative network structures of DL technologies and has become one of the hot topics in various fields of science. The common architecture of a CNN network includes two components: a feature extractor and a fully-connected classifier. In a convolutional layer, each batch of the input dataset is analyzed to obtain different abstract features. Given an inputwith scale (), where , , and refer to the depth, height, and width of . Assuming that a filter with scale () is used in the convolutional layer to extract a feature map , then denotes the value of the -th column of the -th row of the current feature map, as calculated in Eq. (1):
where , , and are the depth, height, and width of the current filter, and
is an activation function, such as, , or function .
Pooling layer is utilized on each feature map to reduce the feature dimensions. There exist various pooling methods, i.e., max pooling and mean pooling. Fully-connected layer is a classification layer of CNN, where all output features of convolutional or pooling layers are connected to all hidden neurons with weight parameters. An example of a CNN architecture is illustrated in Fig.1.
Massive training datasets and iterative training process guarantee the high precision of CNN. However, they become the performance bottleneck when the training network structure is complex and the computing power is insufficient. The characteristics and the corresponding limitations of CNNs are summarized as follows.
Massive weight parameters: There are a large number of weight parameters in the network. If there are different connection structures among layers, the network will become more complex, which requires more computation time cost.
Large-scale training datasets and massive iterative operations for weight updating: High accuracy of CNNs requires high-quality weights, which depends on large-scale training datasets and iterative training.
3.2 Bi-layered Parallel Training Architecture
To accelerate the training process of CNNs, we propose a bi-layered parallel training architecture for large-scale CNNs. We describe the distributed computing environment and training process of the BPT-CNN architecture.
3.2.1 BPT-CNN Architecture
BPT-CNN architecture is composed of two main components: (a) an outer-layer parallel training for multiple CNN subnetworks on separate data subsets, and (b) an inner-layer parallel training for each subnetwork. The proposed BPT-CNN architecture is illustrated in Fig. 2.
(1) Outer-layer parallel training. A data-parallelism strategy is adopted in the outer-layer parallel training process, where a large-scale dataset is split into multiple subsets and allocated to different computing nodes to be trained in parallel. At the parameter server, the global weight parameters of the entire CNN network are updated depending on the local weights from each training branch. The updated global weight parameters are shared to each machine for the next iterative training.
(2) Inner-layer parallel training. The inner layer adopts a task-parallelism strategy to further accelerate the training process of each CNN subnetwork on each computer. Two time-consuming computation tasks are parallelized, including convolutional layer and the local weight training process. Computation tasks on these processes are decomposed depending upon their logical and data dependence, and are executed with multi-threaded parallelism.
3.2.2 Distributed Computing Cluster for BPT-CNN
We construct a distributed computing cluster for the proposed BPT-CNN architecture to efficiently handle massive training datasets and large-scale CNN models. The distributed cluster mainly consists of a main server, several computing nodes with mult-core CPU, and a parameter server, as shown in Fig. 3.
The main server is responsible for CNN training task management as well as data partition and allocation. It copies the CNN training network and allocates them to each computing node. Meanwhile, the training dataset is split into a series of subsets and allocated to the corresponding computing nodes. During the parallel training, the main server monitors the training time costs on computing nodes and migrates datasets for the optimization objective of synchronization delay minimization.
On each computing node, samples in the subset are calculated by the corresponding CNN subnetwork, while the network weight parameters are trained as a local weight set. The training process on each computer is executed in parallel. In addition, each computer is equipped with a multi-core CPU platform. In an inner-layer parallel training, the training process of each CNN subnetwork is further parallelized using multi-threaded parallelism.
The parameter server collects the trained local weight parameters from each computing node and updates the global weight parameters. Then, the updated global weight set is re-allocated to each computing node for the next epoch of training.
3.3 Outer-layer Parallel Training of BPT-CNN
In BPT-CNN’s outer-layer parallel training architecture, we address critical issues of distributed and parallel computing, including data communication, synchronization, and workload balancing. Firstly, considering the heterogeneity of the distributed computing cluster, we propose an incremental data partitioning and allocation strategy to maximize cluster’s workload balancing and minimize data communication overhead. In addition, we propose an asynchronous global weight updating strategy to further minimize the synchronous wait in the global weight update process.
3.3.1 Incremental Data Partitioning and Allocation Strategy
Considering the heterogeneity of computing nodes and their different training speed, to maximize the workload balance of the distributed cluster and minimize the synchronization in global weight update process, we propose an Incremental Data Partitioning and Allocation (IDPA) strategy based on heterogeneous sensing. As there are no dependencies between training samples, they can be partitioned and allocated in batches instead of done at once, according to the computing power of the computing nodes. Assume that there are training samples, computing nodes in a distributed cluster, and training iterations are required for the CNN model. Let () be the number of batches of data partitioning, that is, the entire training dataset is incrementally partitioned and allocated in times, and each time new samples are processed.
Initially, we take the first samples as the training dataset in the first batch. Before the training iteration, we use the constant characteristics of the computing nodes to represent their heterogeneity, i.e., the CPU/GPU frequency is measured. Let be the CPU/GPU frequency of computing node , and the number of samples that will be partitioned and allocated to is calculated as:
After receiving the training samples, each computing node begins the first iteration of training. At the same time, we monitor the execution time of each computing node to complete the iteration and evaluate its actual computing power. Being of the opinion that there might be more applications from different employers executing on the compute nodes, although we can predict the computing power based on the computing node’s CPU/GPU frequency, it is more accurate to evaluate its actual computing power by actual execution time. Therefore, after the first training iteration, we can partition the training dataset according to the actual computing power of the machines. Let be the execution time of computing node to train samples in the current iteration, then we can get the average execution time of for a sample as . We collect the execution time of the computing nodes in the current iteration and predict the execution time required by all computing nodes in the next iteration. Note that new samples will be partitioned and allocated in the -th batch. Namely, there is a total of samples on the computing nodes. The average execution time of the computing nodes in the -th training iteration is calculated as:
where is the average execution time for training a sample by any compute node. To minimize the synchronization latency among computing nodes during the global weight update process, we expect all nodes to complete each iteration as close as possible. Assume that samples on after the -th batch partitioning and allocation, we can calculate the value of as:
Accordingly, we can obtain the number of samples that can accept in the -th batch allocation according to its actual computing power, as calculated as:
Repeat this process times until the entire training dataset is partitioned and allocated to the heterogeneous computing cluster. By considering the heterogeneity of computing nodes, each computing node receives a corresponding number of training samples based on its actual computing power. The total number of samples allocated to each computing node is denoted as , and . The detail steps of the IDPA strategy are described in Algorithm 1.
Benefitting from the IDPA strategy, the training dataset is well partitioned and allocated to the computing nodes, allowing them to complete each iteration in same duration, achieving minimal synchronization delay and maximum workload balancing. Moreover, no data migration is required among compute nodes during the training process, thereby no unnecessary data communication overhead is incurred.
Recall that iterations are required for the CNN model, that is, each sample has an average of times to train the weight parameter set of the CNN network model. After the iterations in the data partitioning process, each computing node continues to execute the remaining iterations on the samples. Since these samples are incrementally allocated to the computing nodes, the actual training times of samples in the iterations are instead of . Therefore, we should recalculate the remaining iterations of the training process, as defined below:
The total number of training iterations of the CNN model is . To simplify the expression, we denote as in the remaining context.
3.3.2 Global Weights Updating Strategies
There are massive connections with different weight parameters among all layers in a CNN network. We define these weight parameters as a weight set. We need to collect the training results on each computing node to update the global weight set for the entire CNN network. In this section, we propose two global weight updating strategies for the CNN network. We respectively define the local weight of each CNN subnetwork and the global weight of the entire CNN network as follows.
Definition 1: Local weight set. The weight parameters among all training layers of a CNN training network are denoted as a weight set. At each computing node, the weight set of a CNN subnetwork is defined as the local weight set of the corresponding subnetwork. The local weight set is trained based on the related data subset. In a distributed computing cluster, there is a local weight set on each computing node, which is updated after training a sample.
Definition 2: Global weight set. The weight set of the entire CNN network is defined as the global weight set. We provide a parameter server for calculating the global weight set by combining parts or all of the local weight sets. The global weight set is aggregated by each local weight set and shared to all computing nodes for the next epoch of training.
(1) Synchronous global weight updating strategy.
We propose a Synchronous Global Weight Updating (SGWU) strategy for BPT-CNN, where each computing node trains all the samples of the current subset and updates the local weight set for an iteration. The local weight sets trained by all computing nodes in the current iteration are gathered at the parameter server, where a new version of the global weight set is generated. The workflow of the SGWU strategy is illustrated in Fig. 4.
Considering that different local weight sets are trained by the corresponding subsets on different computers, having different contributions for the global weight set. We verify the accuracy of each CNN subnetwork after completing an epoch of local iteration training and use it as the contribution of the local weight set. After all computers finish an epoch of local iteration training, the latest local weight set trained on each computer is aggregated to the parameter server to update the global weight set as a new version . The global weight set for the -th epoch of iteration training is defined in Eq. (7):
where and are the local weight set and the corresponding accuracy of the CNN subnetwork on computer , which is obtained in the -th epoch of local iteration training.
In a distributed computing cluster, especially for one equipped with heterogeneous computers, although we use the IDPA strategy to maximize cluster’s workload balancing, the SGWU strategy inevitably faces the synchronization problem during the global weight update process. Due to the different available computing capabilities, computers need different time costs to execute each training iteration. Let be the execution duration for the -th training iteration on computer . The waiting time for synchronization of the entire computing cluster is defined in Eq. (8):
where is the number of iteration training and is the number of computing nodes.
(2) Asynchronous global weight updating strategy.
To address the synchronization problem of SGWU, we propose an Asynchronous Global Weight Updating (AGWU) strategy. In AGWU, once a computing node completes a training iteration on the local samples, the updated local weight set is submitted to the parameter server to immediately generate a new version of the global weight set, without waiting for other computing nodes. Compared with SGWU, AGWU can effectively solve the synchronization waiting problem without increasing the communication overhead. The workflow of the AGWU strategy is shown in Fig. 5.
Considering the heterogeneity of computing nodes, according to the IDPA strategy, each computing node may contain different scales of training subset. In addition, due to the different training speeds, each computing node may also submit its local weight set to the parameter server at different time points, and get different versions of the global weight set. For example, for a computing node with samples, we assume that train the local weight set based on based on the version of the global weight set in the current iteration. During the current training iteration of , the global weight set has been updated from to by other computing nodes. In this case, the low speed computers train the local weight set based on the old version of the global weight set, while the high speed computers based on the newer version. Denote () as the increment between the submitted local weight set and its base version global weight set . Assume that there is another local weight set on and it is trained based on the version of the global weight set, where . It is easy to know that has less impact than in the process of updating . Hence, we can conclude that the local weight sets using the old version of the global weight set have less impact on the global weight updating than those using the new version of the global weight set. Therefore, we adopt a time attenuation factor to measure the impact of each local weight set to the current global weight update process. Denote as the time attenuation factor of the local weight set submitted from , as calculated in Eq. (9):
where is the latest version of the global weight set, and is the version of the global weight set that used to train .
Since there is no dependence among the training subsets on different computing nodes, the global weight update process does not require the training results from all computing nodes at the same time. Once a local weight set is submitted, the current global weight set is immediately updated to a new version , without waiting for other computing nodes. The -th version of the global weight set is updated as:
where is the update component from , and is the accuracy of the CNN subnetwork on computer , which is evaluated by the output of the current local iteration training.
After obtaining the updated global weight set , the parameter server shares to for the next iteration training. Subsequently submitted local weight sets from other computing nodes will update the global weight set based on the latest version. The steps of the AGWU strategy of BPT-CNN is described in Algorithm 2.
In comparison with the SGWU strategy, in the AGWU strategy, each computing node independently participates in the global weight update process, so there is no synchronization waiting problem in AGWU. Furthermore, from the perspective of the entire training process, the update of the global weight set depends on the training results of all compute nodes. According to Eq. (7) and Eq. (10), the global weight set is updated based on the local weight set and the corresponding accuracy of the trained mode in both of SGWU and AGWU strategies.
(3) Data communication of global weight updating.
In BPT-CNN, data communication only incurs between each computing node and the parameter server for the global weight updating and sharing. In AGWU, to reduce the synchronization cost and data communication overhead from the perspective of computing nodes, after receiving a version of the global weight set, each computing node begin a training iteration. It does not participate in the global weight update and receive a new version before completing the current iteration. In both of SGWU and AGWU strategies, the global weight set is updated for every epoch of iteration training. Therefore, both strategies produce the same data communication overhead. Denote as the number of CNN iteration training, data communication in SGWU and in AGWU between the parameter server and all computing nodes is calculated in Eq. (12):
where is a unit communication cost for transmitting the global weight set between the parameter server and a computing node. For each update of the global weight set, there exist 2 iterations of data communication: (1) submitting the local weight set from a computing node to the parameter server, and (2) sharing the updated global weight set from the latter to the former.
4 Inner-layer Parallel Training of BPT-CNN
In the inner-layer parallel training of BPT-CNN, we further parallelize the training process for each CNN subnetwork on each computing node. Two time-consuming training steps are parallelized based on task-parallelism, including convolutional layer and the weight training process. In addition, we propose task decomposition and scheduling solutions to realize thread-level load balancing and critical paths waiting time minimization.
4.1 Parallel Computing Models of CNN Training Process
4.1.1 Parallelization of Convolutional Layer
In the training process of a CNN network, convolutional layers take more than 85.18% of the total training duration, but only train 5.32 - 6.63% of the weight parameters [d14]. Fortunately, the matrix-parallel-based method provides an effective way of performing convolutional operations in parallel. We introduce the parallel mechanism of the convolutional operations into the inner-layer parallel training architecture of BPT-CNN. We use the data partitioning method of the input matrix in CNN and extract all convolution areas from the input matrix. Then, by sharing the filter matrix, all convolution areas are convoluted in parallel with the shared filter matrix.
Given an input matrix with the shape of (), where , , and are the depth, height, and width of . Providing a filter parameter matrix with the shape of (), a feature map is generated via convolutional multiplication on and . Based on the scales of and , the shape of is calculated as:
where , , and are the depth, height, and width of , respectively. Based on the scales of , , and , we calculate the times of convolutional operations in the current convolutional layer, which will be executed in parallel. is calculated in Eq. (13):
is the stride of the convolutional operation and
is the number of the zero padding, which means appendinglaps elements around with the value of 0.
To execute these operations in parallel, we need to identify the convolution areas of the input matrix for each task. A convolution area of includes the begin and end rows and columns. In each convolutional operation task, an element-by-element multiplication is executed on and to generate the corresponding element of . For each element in , location indexes of the convolution area in is calculated in Eq. (14):
After obtaining location indexes of each convolution area, we extract the contents of different convolution areas and perform the related convolutional operations in parallel, without waiting for the end of the previous convolutional operations. These parallel convolutional operations on different areas access the input and filter matrices repeatedly and simultaneously from the same memory without updating the contents. Without data dependence among these tasks, different tasks can access different convolution areas in simultaneously. An example of the parallel convolutional operation of each CNN subnetwork in BPT-CNN is illustrated in Fig. 6 and the steps of this process are described in Algorithm 1.
As defined in Eq. (13), the maximum parallelism degree of a convolutional layer is equal to the number of elements of the output feature map, which is computed according to the scale of and . The total execution duration of a convolutional layer is calculated in Eq. (15):
where is the number of elements in and is the execution duration of the -th operation task.
4.1.2 Parallelization of Local Weight Training Process
To distinguish the weight set of the entire CNN network and that of each CNN subnetwork, we respectively define the global weight set and local weight sets in Section 3.3.2. In this section, training process of the local weight set of each CNN subnetwork is parallelized on each computer.
After obtaining the outputs of a CNN subnetwork, the error (loss function) of each layer is evaluated from the output layer to the first convolutional layer using the Back Propagation (BP) method. The Stochastic Gradient Descent (SGD) process[24, 25] is involved in updating the weight parameters among all layers of the current CNN subnetwork. In the output layer, the square error of all neurons is taken as the objective function of weight training, as defined in Eq. (16):
where denotes the loss function of the input , and and are the label and the output of the neuron in the output layer, respectively. The error of is the inverse of the partial derivative of the error of the input of , as calculated in Eq. (17):
where is the input of the neuron that connected with , that is, is the output of . is the weight of the connection between neurons and .
Let be the set of errors of neurons in the -th layer . Based on , the error set of neurons in is calculated in Eq. (18):
where is the weight set of and is the weighted input of , as defined as:
where is the output matrix of , consisting of each element . An example of the calculation process of loss function between layers and is shown in Fig. 7.
We parallelize the process for the loss function calculation, where the errors of neurons in the same layer are computed in parallel. In the convolutional layer, each neuron in the output layer (a feature map) is connected to a part of neurons in the input layer (an input matrix). In such a case, the error calculation of neurons in the previous layer depends on the results of a part of neurons in the next layer . Hence, we parallelize this process depending on neurons in . An example of the loss function calculation parallelization is shown in Fig. 8.
After obtaining the error set of neurons in , we calculate the error of each neuron in . Let be the error component of neuron in for in , as defined as:
where and are the height and width of the filter parameter matrix between and . Based on the error set of neurons, the weight parameters of are computed subsequently. The gradient of each weight is calculated in parallel, as defined in Eq. (21):
The gradient of the bias weight is computed in Eq. (22):
Based on the gradient values, each weight is updated in Eq. (23):
where is the learning rate of the CNN network.
4.2 Implementation of Inner-layer Parallel Training
We implement the inner-layer parallel training of BPT-CNN on computing nodes equipped with multi-core CPUs. Based on the parallel models proposed in the previous section, computing tasks of these training phases are decomposed into several subtasks. The workflow of task decomposition for a CNN subnetwork is illustrated in Fig. 9.
(1) Task priority marking.
According to the logical and data dependence of the decomposed subtasks, a task Directed Acyclic Graph (DAG) is created. With the thread-level load balancing and completion time minimization as the optimization goal, the priorities of tasks in the task DAG are marked. We set a maximum value for the entrance task of the task DAG graph. Then, the priorities of tasks in each level are set according to the tasks’ level. Specifically, upstream tasks’ priorities are higher than that of downstream tasks, while tasks at the same level have the same priority.
(2) Task scheduling and execution.
Based on the priorities of tasks, we allocate these tasks to threads on the multi-core CPU platform using the priority task scheduling algorithm . Based on the task priorities, tasks of the entire CNN training network are allocated to different threads on the different CPU cores. An example of the task scheduling of the CNN training network with multi-threaded parallelism is illustrated in Fig. 10.
5.1 Experimental Settings
All of the experiments are conducted on a distributed computing cluster built with 30 high-performance computing nodes, and each of them is equipped with Intel Xeon Nehalem EX CPU and 48 GB main memory, respectively. Each Nehalem-EX processor features up to 8 cores inside a single chip supporting 16 threads and 24MB of cache. Comparison experiments are conducted to evaluate the proposed BPT-CNN by comparing with Tensorflow CNN , DisBelief , and DC-CNN 
algorithms, in terms of accuracy and performance evaluation. Large-scale public image datasets from ImageNet with 14,197,122 samples are used in the experiments.
5.2 Accuracy Evaluation
We evaluate the accuracy of BPT-CNN by comparing with Tensorflow, DisBeilef, and DC-CNN. For each algorithm, five-fold experiments on the ImageNet dataset with 100 epoch iterations are conducted and the average values of accuracy and the Area Under the Curve (AUC) are compared. The experimental results of accuracy and AUC of the comparison algorithms are presented in Fig. 11.
As shown in Fig. 11 (a) and (b), BPT-CNN achieves the similar accuracy with compared algorithms, as well as higher AUC values in most of the cases. The average value of accuracy of BPT-CNN is equal to 0.744, while that of Tensorflow, DisBelief, and DC-CNN is 0.721, 0.722, and 0.639, respectively. Because of the parallel training and global weight updating, BPT-CNN narrows the impact of local overfitting and obtains more stable and robust global network weights. As the epoch of iteration training increases, both of accuracy and AUC of BPT-CNN steadily increases. AUC of BPT-CNN is greater than that of Tensorflow by 5.91%, on average, 9.56% higher than that of DisBelief, and 10.09% higher than that of DC-CNN. Therefore, compared with Tensorflow, DisBelief, and DC-CNN, BPT-CNN does not reduce the accuracy of CNNs. Moreover, benefitting from the global weight updating strategy, BPT-CNN achieves more robustness than compared algorithms.
5.3 Performance Evaluation
5.3.1 Execution Time of Comparison Algorithms
The execution time of these algorithms is compared using 100 training iterations in various configurations: different data sizes and computing cluster scales. The comparison of the average execution time of each algorithm in each case is shown in Fig. 12.
As can be seen in Fig. 12 (a) and (b), the proposed BPT-CNN algorithm achieves higher performance than the compared algorithms in most of the cases. Benefitting from the data-parallelism strategy, when the data size increases, the volume of each partitioned subset on each computer is slightly increased, leading to a slight increase in the average workload of each computer. For example, when the number of training samples increases from 100,000 to 700,000, the execution time of BPT-CNN rises from 62.77s to 307.35s, while that of Tensorflow increases from 54.38s to 454.23s, and that of DC-CNN sharply increases from 91.21s to 929.74s. In addition, taking advantage of the IDPA strategy, the proposed BPT-CNN algorithm owns scalability over the compared algorithms. When the scale of the computing cluster expended, the execution time of BPT-CNN and Tensorflow is significantly reduced. Experimental results indicate that BPT-CNN achieves high performance and scalability in distributed computing clusters.
5.3.2 Execution Time Comparison for Fixed Accuracy
Considering the different training architectures of various comparison algorithms, we discuss how these algorithm trade off performance and accuracy with resource consumption. We discuss the training iterations required for each algorithm to achieve different accuracy, and then measure the execution time each algorithm takes under different computing resources. The comparison results are shown in Table I and Fig. 13.
From Table I, all algorithms use similar iterations to achieve an accuracy of 0.650. However, to achieve higher accuracy, BPT-CNN requires fewer iterations than Tensorflow, DisBelief, and DC-CNN. For example, BPT-CNN requires 42 iterations to achieve an accuracy of 0.750, while Tensorflow uses 64 and DisBelief uses 85, and DC-CNN requires up to 147. In addition, to achieve an accuracy of 0.750, we compare the actual execution times of each algorithm under different numbers of computing nodes and CPU cores, as shown in Fig. 13 (a) and (b). When the scale of the computing cluster and CPU cores expended, the execution time of BPT-CNN and Tensorflow is significantly reduced. In contrast, the execution time of DisBelief and DC-CNN algorithms is increased when the cluster scale reaches a certain amount (e.g., 25-35), which is caused by the more data communication among the increasing machines. Experimental results indicate that BPT-CNN achieves higher accuracy and performance than other algorithms using the same computing resource. Moreover, when the scale of computing nodes and CPU cores increases, the performance benefits of BPT-CNN is more noticeable.
5.3.3 Execution Time of BPT-CNN with Different Strategies
We further evaluate the performance of the proposed BPT-CNN under different global weight update and data partitioning strategies. To evaluate the effectiveness of the IDPA strategy, we perform the same work using the Uniform Data Partitioning and Allocation (UDPA) strategy, where the training dataset is uniformly partitioned into partitions and allocated into the computers. Comparison experiments are conducted in terms of data size, computing cluster scale, CNN network scale, and thread size. The average execution time of BPT-CNN with different strategies is presented in Fig. 14.
In Fig 14 (a), 7 different scales of CNN network are constructed in the experiments, as described in Table II. Here “layers(Conv)” and “filters(Conv)” denote the number of the convolutional layer and that of filters at each layer, respectively. “layers(FC)” and “neurons(FC)” denote the number of layers in the fully-connected layers and number of neurons in each layer, respectively.
By comparing strategies AGWU and SGWU, the execution time of BPT-CNN using AGWU is obviously lower than SGWU in most cases. In AGWU, because of the asynchronous update of the global weight set, each computing node uses the minimum time to wait for the global weight update and trains almost continuously. In addition, by comparing data partitioning strategies IDPA and UDPA, benefitting from the incrementa data partitioning, the workload of computing nodes stays well balanced, which further shortens the waiting time among different machines. As shown in each case in Fig. 14, the execution time of BPT-CNN with IDPA is significantly lower than that with UDPA strategy. Hence, BPT-CNN using AGWU+IDPA strategies exhibits the most efficient performance against other cases. Moreover, with the increase of data size or CNN network scale, the execution time of BPT-CNN using AGWU+IDPA maintains a slow rise. When the computing cluster scale and the number of threads on each machine increases, the benefits of AGWU+IDPA are more noticeable. Taking advantage of the IDPA and AGWU strategies, BPT-CNN achieves significant strength in terms of performance.
5.4 Data Communication and Workload Balancing
We evaluate the proposed BPT-CNN architecture in the view of data communication overhead and workload balancing by comparing with Tensorflow, DisBeilef, and DC-CNN algorithms. 600,000 training samples are used in the experiments, and the number of computing nodes increases from 5 to 35 in each case. Experiment results of data communication and workload balancing are shown in Fig. 15.
It is clear from Fig. 15 (a) and (b) that, in most cases, BPT-CNN owns significant workload balancing and lower data communication costs than other algorithms. Due to the use of the IDPA strategy in BPT-CNN, there is only communication overhead between the computing nodes for transmitting local/global weight parameters, and no training sample migration is required. Hence, as the number of computing nodes increases from 5 to 35, the communication overhead of BPT-CNN slowly increases from 2.35 MB to 11.44 MB. In contrast, due to dynamic resource scheduling, Tensorflow generates communication overhead from 2.73 MB for 5 computers to 45.23 MB between 35 computers. Moreover, to achieve workload balancing, DisBelief and DC-CNN use data migration operations during training, which results in heavy communication overhead between computers.
We compare the workload balance of each algorithm under different scales of the computing cluster, as shown in Fig. 15 (b). Our BPT-CNN architecture considers the heterogeneity of compute nodes and allocates corresponding workloads based on the actual computing power of each compute node. Hence, as the scale of the cluster increases, BPT-CNN achieves a stable workload balance, keeping between 0.89 and 0.80. In contrast, without heterogeneity-aware data allocation, the workload of other comparison algorithms is not as balanced as BPT-CNN. The unbalanced workload further leads to long waiting time for synchronization and more execution time for the entire CNN network. Experimental results demonstrate that BPT-CNN significantly improves the workload balance of the distributed computing cluster with acceptable communication overhead.
This paper presented a bi-layered parallel training architecture to accelerate the training process of large-scale CNNs. In the outer-layer parallel training, the performance of the entire CNN network is significantly improved based on data-parallelism optimization, where the issues of data communication, workload balance, and synchronization, are well addressed. In the inner-layer parallelism, the training process of each CNN subnetwork is further accelerated using task-parallelism optimization. Extensive experimental results on large-scale datasets indicate that the proposed BPT-CNN effectively improves the training performance of CNNs in distributed computing clusters with minimum data communication and synchronization waiting.
For future work, we will further concentrate on scalable CNN models and the parallelization of deep learning algorithms on high-performance computers. In addition, development of deep learning algorithms specific applications is also an interesting topic, such as scalable CNNs for images and LSTMs for time series.
This research is partially funded by the National Key R&D Program of China (Grant No. 2016YFB0200201), the Key Program of the National Natural Science Foundation of China (Grant No. 61432005), the National Outstanding Youth Science Program of National Natural Science Foundation of China (Grant No. 61625202), the International Postdoctoral Exchange Fellowship Program (Grant No. 2018024), and the China Postdoctoral Science Foundation funded project (Grant No. 2018T110829). This work is also supported in part by NSF through grants IIS-1526499, IIS-1763325, CNS-1626432, and NSFC 61672313.
-  A. Coates, B. Huval, T. Wang, D. J. Wu, and A. Y. Ng, “Deep learning with cots hpc systems,” in ICML’13, 2013, pp. 1337–1345.
-  R. Gemulla, E. Nijkamp, P. J. Haas, and Y. Sismanis, “Large-scale matrix factorization with distributed stochastic gradient descent,” in KDD’11, 2011, pp. 69–77.
-  J. Deng, W. Dong, R. Socher, L. Li, K. Li, and F. Li, “Imagenet: A large-scale hierarchical image database,” in IEEE CVPR’09, 2009, pp. 248–255.
-  G. Heigold, V. Vanhoucke, A. Senior, P. Nguyen, M. Ranzato, M. Devin, and J. Dean, “Multilingual acoustic models using distributed deep neural networks,” in ICASSP’13, 2013, pp. 8619–8623.
-  L. Cavigelli, D. Gschwend, C. Mayer, S. Willi, B. Muheim, and L. Benini, “Origami: A convolutional network accelerator,” in Proceedings of the 25th edition on Great Lakes Symposium on VLSI, 2015, pp. 199–204.
-  L. Song, Y. Wang, Y. Han, X. Zhao, B. Liu, and X. Li, “C-brain: a deep learning accelerator that tames the diversity of cnns through adaptive data-level parallelization,” in DAC’16, 2016, p. 123.
-  B. Recht, C. Re, S. Wright, and F. Niu, “Hogwild: A lock-free approach to parallelizing stochastic gradient descent,” in NIPS’11, 2011, pp. 693–701.
-  S. Fan, J. Fei, and L. Shen, “Accelerating deep learning with a parallel mechanism using cpu+mic,” International Journal of Parallel Programming, pp. 1–14, 2017.
-  L. Jin, Z. Wang, R. Gu, C. Yuan, and Y. Huang, “Training large scale deep neural networks on the intel xeon phi many-core coprocessor,” in IEEE IPDPS’14, 2014, pp. 1622–1630.
-  Q. Ho, J. Cipar, H. Cui, S. Lee, J. K. Kim, P. B. Gibbons, G. A. Gibson, G. Ganger, and E. P. Xing, “More effective distributed ml via a stale synchronous parallel parameter server,” in NIPS’13, 2013, pp. 1–9.
-  J. Liu, H. Wang, D. Wang, Y. Gao, and Z. Li, “Parallelizing convolutional neural networks on intel many integrated core architecture,” in International Conference on Architecture of Computing Systems, 2015, pp. 71–82.
A. A. Huqqani, E. Schikuta, S. Ye, and P. Chen, “Multicore and gpu parallelization of neural networks for face recognition,”Procedia Computer Science, vol. 18, pp. 349–358, 2013.
-  D. Strigl, K. Kofler, and S. Podlipnig, “Performance and scalability of gpu-based convolutional neural networks,” in Euromicro Conference on Parallel, Distributed and Network-based Processing, 2010, pp. 317–324.
-  J. Dean, G. Corrado, and R. M. et al., “Large scale distributed deep networks,” in NIPS’12, 2012, pp. 1223–1231.
-  Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” in ACMMM’14, 2014, pp. 675–678.
-  M. Abadi, A. Agarwal, and P. Barham, “Tensorflow: Large-scale machine learning on heterogeneous distributed systems,” CoRR, vol. abs/1603.04467, no. 5, pp. 725–730, 2016.
M. Mohammadi, A. Krishna, N. S., and S. K. Nandy, “A hardware architecture for radial basis function neural network classifier,”IEEE TPDS, vol. 29, no. 3, pp. 1829–1844, 2017.
-  I.-H. Chung, T. N. Sainath, B. Ramabhadran, and M. P. et al., “Parallel deep neural network training for big data on blue gene/q,” IEEE TPDS, vol. 28, no. 6, pp. 1703–1714, 2017.
-  M. Sankaradas, V. Jakkula, and S. Cadambi, “A massively parallel coprocessor for convolutional neural networks,” in ASAP’09, 2009, pp. 53–60.
-  J. Bilski and J. Smolag, “Parallel architectures for learning the rtrn and elman dynamic neural networks,” IEEE TPDS, vol. 26, no. 9, pp. 2561–2570, 2015.
-  T. Chilimbi, Y. Suzue, J. Apacible, and K. Kalyanaraman, “Project adam: Building an efficient and scalable deep learning training system,” in USENIX OSDI’14, 2014, pp. 571–582.
-  Y. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks,” IEEE Journal of Solid-state Circuits, vol. 52, no. 1, pp. 127–138, 2017.
-  S. Chakradhar, M. Sankaradas, V. Jakkula, and S. Cadambi, “A dynamically configurable coprocessor for convolutional neural networks,” in ISCA’10, 2010, pp. 247–257.
-  Q. V. Le, J. Ngiam, A. Coates, A. Lahiri, B. Prochnow, and A. Y. Ng, “On optimization methods for deep learning,” in ICML’11, 2011, pp. 265–272.
-  M. Zinkevich, M. Weimer, L. Li, and A. J. Smola, “Parallelized stochastic gradient descent,” in NIPS’10, 2010, pp. 1–9.
-  L. Zhang, K. Li, C. Li, and K. Li, “Bi-objective workflow scheduling of the energy consumption and reliability in heterogeneous computing systems,” Information Sciences, vol. 379, pp. 241–256, 2017.