Homomorphic Parameter Compression for Distributed Deep Learning Training

11/28/2017 ∙ by Jaehee Jang, et al. ∙ Seoul National University 0

Distributed training of deep neural networks has received significant research interest, and its major approaches include implementations on multiple GPUs and clusters. Parallelization can dramatically improve the efficiency of training deep and complicated models with large-scale data. A fundamental barrier against the speedup of DNN training, however, is the trade-off between computation and communication time. In other words, increasing the number of worker nodes decreases the time consumed in computation while simultaneously increasing communication overhead under constrained network bandwidth, especially in commodity hardware environments. To alleviate this trade-off, we suggest the idea of homomorphic parameter compression, which compresses parameters with the least expense and trains the DNN with the compressed representation. Although the specific method is yet to be discovered, we demonstrate that there is a high probability that the homomorphism can reduce the communication overhead, thanks to little compression and decompression times. We also provide theoretical speedup of homomorphic compression.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 3

page 4

page 5

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning (DL) derives structured information from raw data using deep neural networks (DNN). DL finds hierarchical representations of data through several non-linear layers of a DNN. When the problem to be solved by using DL is challenging, we need to grasp complicated representations from the data. With the use of DNNs to solve an increasing number of high-abstraction problems in various fields, the size of training models and the computational load to train the models have continued to grow. Under current software and hardware constraints, DNN training demands a massive amount of processing time keuper2016distributed , naturally leading to the need for distributed deep learning naturally uprose bekkerman2011scaling ; dean2012large ; recht2011hogwild ; kim2016deepspark ; tensorflow2015-whitepaper . Distributed DL divides the workload (training data or model) to different machines and aims for faster learning while maintaining the original performance of the model.

DNN iteratively optimizes weight parameters based on gradients computed from feedforward/backpropagation, which is highly sequential. Hence the implementation of distributed DNN training requires specific design principles and strategies as they have been suggested for years 

xing2016strategies . To give a brief illustration on how distributed DL is implementated in general, let’s take the synchronous SGD update scenario as an example (Fig. 1). Synchronous SGD trains by iterating a set of processes to update global parameters, described by a dotted box in Fig. 1. The set of global parameter update consists of the following steps. First, all the worker nodes train until the designated number of iterations (Local Parameter/Gradient Computation in Fig. 1). Then, all the worker nodes respectively push their local parameters to the parameter server (Local Parameter Transfer in Fig. 1). Lastly, the parameter server decides global parameters by aggregating all the pushed local parameters(Global Parameter Update in Fig. 1), and pulls them to the worker nodes(Global Parameter Broadcast in Fig. 1).

The worker nodes participating in the training frequently exchange their training status with other nodes so that the model can reflect all the divided workloads. However, DNN models are large and so the communication load. Hence they cause bottlenecks on transmission because of the contrained communication bandwidth. Especially under commodity hardware environments, the weight-transfer time overwhelms even the computing time. Along with the transmission time, the time technically required for communication, that is, the time to prepare and sustain communication, is also included as communication overhead. The communication overhead is one of the main factors that increase the parallel training time. In order to alleviate the communication overhead, attempts have been made to reduce the model size before communication iandola2016squeezenet ; spring2016scalable ; elgohary2016compressed .

The goal of our study is to demonstrate homomorphic parameter compression, which is a novel concept of compressed deep learning. As the term homomorphic suggests, it is a compression method that reduces the size of parameters and allows key operations of DL to be executed without decompression. Since parameters are transferred numerous times during distributed training settings, this method can remarkably reduce the time consumed in communication, which is the main rate-limiting step of distributed training. Furthermore, homomorphism prevents the generation of additional overhead by repetitive compression and decompression. The main contributions of this paper include the followings: 1) To our knowledge, this is the first attempt to demonstrate homomorphic parameter compression. 2) We theoretically characterize the possible factors in the parameter compression, e.g., the compression ratio, and provide thorough simulative analyses. 3) We provide the theoretical reduction in training time of the homomorphically compressed distributed training in function of the number of participating worker nodes for different values of the compression ratio.

Figure 1: Simplified process demonstration of distributed DL using synchronous SGD

2 Literature Survey on Compressed Deep Learning

Numerous studies have suggested compression in deep learning han2015deep ; courbariaux2015binaryconnect ; courbariaux2016binarized ; seide20141 . Existing compression methods aim at fitting very-large-scale models into a mobile device or single FPGA chip, at alleviating the high communication overhead due to distributed training, and at improving computational performance as well as storage and power efficiency.

Post-training compression for inference.   A series of studies reduced the storage and energy required to run inference on large DL models and deploy them on embedded systems or mobile devices. Deep compression han2015deep used pruning, trained quantization, and Huffman encoding on weights and demonstrated a high compression ratio to fit in on-chip memory. CNNpack NIPS2016_6390

demonstrated convolutional neural network (CNN) compression in the discrete cosine transform (DCT) frequency domain. These methods can effectively reduce the size of networks while retaining pre-trained information. On the other hand, since they are designed for compression after training is completed, the time consumed in compression is a minor issue.

In-training compression for refficient deep learning.

   Employing compression in training enables efficient computation and communication under limited resources. Especially when transferring parameters in distributed DL, the constrained network bandwidth may consume a large amount of time in communication and slow down the entire training process. The following approaches proposed compression for both training and inference in order to improve computational performance as well as energy and storage efficiency. We classified the approaches into two types: repetitive (de)compression and one-time compression. The training process of the methods is illustrated in Fig. 

2.

Figure 2: Simplified process demonstration of mid-training compression methods, based on synchronous SGD: (a) repetitive (de)compression, and (b) one-time compression. Note that we considered parameter transfer time as the time in which a worker node is not training but waiting for new gradient update. (Best viewed in color)

Repetitive (de)compression   Some methods encode weights (or gradients) for every iteration We call such methods repetitive (de)compression

methods. Weight binarization methods 

courbariaux2015binaryconnect ; courbariaux2016binarized binarize weights in order to train from low-power devices or specialized hardware. The binarized weights are used only during the forward and back propagations but not during parameter update. The authors of seide20141 ; alistarh2016qsgd used gradient binarization with distributed DL training in order to reduce the communication overhead. They encoded gradients during global parameter update, and worker nodes have to decode the gradients to update their local parameters. FreshNet Chen:2016:CCN:2939672.2939839 combined hashing Chen:2015:CNN:3045118.3045361 with DCT to compress CNN models and train in the frequency domain. Good compression performance and robustness in model accuracy were demonstrated . However, continuous compressions and decompressions involve high risks for additional compression overhead, as shown in Fig. 2(b). Although good compression performance was demonstrated, more careful considerations are needed to utilize the aforementioned methods in distributed training, as is done with QSGD alistarh2016qsgd by double buffering. A quantitative analysis of compression overhead will be presented in Section 4.

One-time compression   If DL models are compressed only once, the compression time will not significantly affect the overall training time. Compressed linear algebra (CLA) elgohary2016compressed exploits lightweight database compression techniques to compress matrices and perform computations in the compressed representation. Despite the compression ratio and the operation performance being close to that of uncompressed operations, it is difficult to be applied directly to distributed DL training because more nonlinear operations are required in DL training. Especially, operations that are frequently used in DL training, such as normalization and pooling, are not yet conducted in the compressed representation 111The confirmed version of September 2017 can be found at https://github.com/apache/systemml.

3 Algorithmic Design

Size Compression Compression Decompression
ratio time time
AlexNet Caffemodel MB s s
ILSVR2012 Train Data GB s s
CIFAR-100 Train Data MB s s
Table 1: Examples of Gzip compression

Communication overhead is one of the major drawbacks of distributed DL, and compressing the parameters can reduce the communication workload. On the other hand, if the parameters are compressed and decompressed at every update, there are high risks for additional compression overhead. As indicated in Table 1, compressing and decompressing parameters can take considerable time. Therefore, we need a compression approach that does not increase the computing time significantly while reducing the size of parameters properly.

We propose homomorphic parameter compression, inspired from homomorphic encryption rivest1978data . Homomorphism suggests an algebraic system that is encoded from another algebraic system and performs operations equivalent to those of the encoded system. Our goal is to propose a compression method that can be trained without decompression. By referring to the early formulation of homomorphic encryption rivest1978data , the definition of homomorphic compression is as follows. Suppose we have a system that consists of a set of parameters and operations s concerned with training. The possible s may vary depending on the model structure. As Fig. 3

shows, linear operations take the majority of the operations and nonlinear operations such as pooling and relu are included as well. We propose finding the encoding function

, where , where is the compressed system.

  1. An encoded version of a weight should be smaller than original weight .

  2. should be easy to compute. Conversion by should not take too much time. We point out how compression overhead can slow down the total training time in Fig. 5. We can continue training even without decompression if the compression time is long enough to affect the total training time, as it is difficult to expect temporal gain through homomorphic compression in such a case.

  3. The operations should be efficiently computable. When training DL, varied operation functions (

    s) are required, such as matrix multiplication and activation functions, as shown in Fig. 2 in Supplement. If we encode the functions to equivalent

    operations, the computational efficiency of operations is also required to be high.

Figure 3:

GPU kernel analysis of AlexNet training (Caffe)

4 Experimental Study

4.1 Experimental Settings

Notations & formulated assumptions We assumed of a distributed training environment where there are M worker nodes. It parallelizes a single-node training with minibatch size B on target dataset of size D. The single node consumes C of time when computing a minibatch, and the total size of weight parameters is measured as W.

Optimization Scheme

When conducting distributed training, we can define various optimization schemes according to when local parameters have been updated (worker nodes have been trained) with global parameters (parameters that all worker nodes share). Communication overhead is inevitable regardless of the strategy we shall choose. In order to emphasize the effect of communication overhead with different numbers of worker nodes, we assumed that we optimize the DNN training by synchronous stochastic gradient descent (synchronous SGD).

Synchronous SGD trains by iterating a set of processes to update global parameters, as described in Fig. 1. In this paper, we defined the time required in the local parameter/gradient computation step until the designated number of iterations as computation time, , and the time required in the remaining steps as parameter transfer time, which adds up to the time required for one set of global parameter update, .

Minibatch size, We assumed data-parallelized training, which divides the training dataset into worker nodes. As the dataset is divided, the ratio of the original minibatch size to training data size becomes larger. If the batch size is too large, the training may be delayed bengio2012practical . Hence it is assumed that the minibatch size is reduced by the reduction amount of the training set. Therefore, .

Minibatch computation time, As the minibatch size decreases by , the time consumed for one iteration is expected to decrease by the same ratio. Therefore,

Computation time per update, Since the data and minibatch size are decreased by

, training iterations are required to be tje same as single-node training in order to train the same number of epochs. Hence,

.

Parameter transfer time, In synchronous SGD, every local parameter of the worker node is collected when updating a global parameter. If the number of participating worker nodes increases, the number of parameters to be exchanged linearly increases. That is, if we define the size of the weight parameter as and train with worker nodes, the number of parameters required to communicate is . Letting the transmission rate of the cluster be , the parameter transfer time is denoted as .

From the assumptions stated in Section 3

, we simulated distributed training for CNN and RNN models. We trained the ImageNet dataset 

deng2009imagenet using AlexNet krizhevsky2012imagenet to simulate distributed CNN training. For simulative distributed RNN training, we conducted STREET model Smith2016 training on the French Street Name Signs (FSNS) dataset 222TensorFlow implementation at https://github.com/tensorflow/models/tree/master/street. We used Caffe jia2014caffe

for the CNN model and TensorFlow 

tensorflow2015-whitepaper for the RNN model as computing engines. Synchronous SGD iterates the same processes of global parameter update, where the worker nodes train up to the designated iterations and communicate parameters to apply the global training trend. Hence, Figs. 456, and Fig. 7(b), (c) represent only one set of global updates for the total training trend in terms of time.

The hardware we used in the simulation is a commodity cluster. We used a homogeneous cluster consisting of 25 identical machines connected via Gigabit Ethernet. Each worker node has an Intel Core i7-4790 processor with 16GB of main memory and an NVIDIA GTX970 GPU.

4.2 Simulated Analysis of Communication Overhead Effect and Naïve Parameter Compression

Figure 4: Simulated analysis of vs. on vanilla synchronous SGD training. We conducted experiments for one set of global parameter update. (a) 200-iter AlexNet training (Caffe). (b) 20-iter STREET training (TensorFlow).

Fig. 4 shows the simulated global parameter update of vanilla synchronous SGD. As the number of nodes increases, the computation time decreases but the parameter transfer time increases. At some points, parameter transfer takes more time than minibatch computation, which demonstrates the serious inefficiency of resource utility due to communication overhead in distributed training. If we keep increasing the number of nodes, the parameter transfer time even exceeds the single minibatch computation time, becoming a hindrance rather than contributing to speedup.

Figure 5: Simulated analysis of vs. on vanilla synchronous SGD training with a common compression method. We conducted experiments for one set of global parameter update. We assumed Gzip compression/decompresion at every parameter update. (a) 200-iter AlexNet training (Caffe). (b) 20-iter STREET training (TensorFlow).

If we can compress weight parameters by x, the parameter transfer requirement will be reduced by . This will also reduce the time required for parameter transfer. However, the time consumed in compressing parameters is a problem. Suppose after a worker node finished its batch, it compresses the trained parameter using Gzip. Then, the parameter server aggregates the compressed parameters and since Gzip-compressed data are not computable, the parameter server has to decompress the parameters to aggregate and average. After global parameters are set, the parameter server compresses the parameters again and sends them back to the worker nodes. When a worker node receives the compressed parameters, it has to decompress them again to keep training.

If compressed training is conducted as explained above, there is a high possibility of another overhead called compression overhead. We simulated this in Fig. 4, and the results are shown in Fig. 5. If we use Gzip compression, the parameter transfer time is decreased because of the reduced size of parameters. On the other hand, the compression and decompression time can significantly exceed the original training time, as shown in Fig. 5.

4.3 The Expected Gain of Homomorphic Compression

From the simulation results in Section 4, we can learn two things. First, compressing the parameter size can reduce the communication workload, and second, we need a compression approach that takes compression overhead into account. Therefore, we propose homomorphic parameter compression here.

Fig. 6(a) shows the theoretical speedup of homomorphically compressed distributed training based on the simulative analysis conducted in this section. It is represented as a function of the number of participating worker nodes for different compression ratios. The orange dotted curve in Fig. 6(a) shows the ideal distributed training case, where there is no communication at all so we can achieve linear speedup. The yellow dashed curve is the speedup of vanilla SGD. The ideal speedup in homomorphic compression occurs when there is no overhead due to the increased operation time. Hence, the green solid and red double solid curves are the theoretical upper bounds in speedup when the compression ratio is 0.2 and 0.5 respectively.

Figure 6: (a) Theoretical speedup as a function of the number of worker nodes for different compression ratios when training the ImageNet with AlexNet. (b)&(c) Simulated analysis of the proposed method for 200-iter AlexNet training on 16 worker nodes, when and (b) . and (c) . We illustrated the expected and for one set of global parameter update. Compression and decompression are not shown in these graphs because they are only performed once throughout the training.(Best viewed in color)

In actual homomorphic compression, the encoded operations are likely to require more time than the original operations. The simulated results based on this assumption are shown in Fig. 7. Note that the compression time in Fig. 7 is not included in every parameter update but only in the early stage. We assumed that computing in the compressed representation may take more time than the original computations. However, if we can manipulate the operation overhead and compression ratio, fast and large-scale DL training is attainable, as shown in Fig. 7.

Figure 7: Simulated analysis of the proposed method. (a) Comparison of total Alexnet training time among the proposed method, synchronous SGD with Gzip compression, and vanilla synchronous SGD. (b)&(c) vs. on synchronous SGD training with proposing method. We illustrated the expected result of one set of global parameter update. Compression and decompression shown in the graphs are performed only once throughout the training. (b) 200-iter AlexNet training (Caffe) when , . (c) 20-iter STREET training (TensorFlow) when , .

5 Discussion and Future Work

We analyzed the effect of homomorphic compression on distributed training in Section 4. In this section, we discuss the parameters required when designing a homomorphic compression method. Below, we present an in-depth analysis on the computational efficiency for ’s based on the assumptions made in Section 3 and Section 4:

Let be a compression ratio, (where ), and and be the time consumed in performing and respectively. And we define operation overhead as . Then, the compressed minibatch computation time and parameter transfer time are expressed as , , respectively.

By setting the upper bound of the total training time as (where ), the relationship between h and can be obtained as expressed by Eq. 5. Therefore, we can achieve the desired speedup if we can fit and under Eq. 5.

Fig. 6(b) expected tendency of training time with respect to and . The most ideal training time for worker nodes to learn a model that takes batch computation time is (orange dotted line in Fig. 6(b)). However, even when we train in parallel, the learning status among nodes is interchanged. Therefore, the realistic training time of synchronous SGD, considering communication time, is the same as the red line in Fig. 6(b).

In addition to the upper bound the computation time for , and are needed to be designed in consideration of frequently used operations. Fig. 8 shows the GPU profiling result of AlexNet training with Caffe, and it suggests that operations such as gemm take most of the computation time. It is expected that, if we significantly reduce the time required for computing gemm, the operation overhead effect will be much weaker. Our future work is to propose a detailed homomorphic compression method.

References