Applications of deep learning to solve problems such as classification, regression and prediction have seen tremendous growth in this era because of its success in solving these problems. The availability of large-scale datasets and high-performance computing devices have also enlarged the scope of deep learning algorithms. Although deep learning is helping us find automated solutions to many real-world problems, the time required for training a large deep learning model has always been a bottleneck. It is seen that some deep learning models even take several days to finish. The time required for training a deep learning model depends mainly on two things: volume of the dataset and size of the model.
Nowadays, because of the availability of various sensor devices, data volume is increasing rapidly. It is a common and also required practice to feed a deep learning model with a huge volume of dataset. This immense size of training data results in long training time which is a bottleneck of deep learning models. Besides the size of the dataset, various giant technology companies need to train deep learning models with a lot of layers and parameters. For example, a popular deep learning model, ResNet-50[resnet] has around 25.6 million parameters. This numerous number of parameters is another reason for slow deep learning training.
As data size increases, deep learning models need to process a large set of data in memory. Today, because of the improvement of computing devices, feeding large data into memory is not a problem. Google and some other companies have computing devices that are so powerful that even the largest deep learning models cannot make full use of these computing devices. Considering that these super-powerful computing devices are available, is it possible to reduce the training time of the largest deep learning models significantly without affecting the accuracy?
It is observed that an increase in the batch size results in a decrease in total training time. Increasing the batch size up to a certain threshold is also good for the accuracy of deep learning. Another important fact to note is that increasing the batch size indefinitely reduces the model accuracy significantly which prevents deep learning practitioners to keep the batch size within reasonable limit although they might have powerful computing devices to feed the data into memory. Besides increasing the batch size, another possible solution to reduce the training time is parallelizing deep learning. Training process of a deep learning model can be parallelized in two ways: parallelizing data and parallelizing models. Machine learning systems such as TensorFlow allow us to parallelize deep learning training. Besides, we also have distributed systems such as Apache Spark[apache-spark]. Some other machine learning libraries are also available on top of Apache Spark which include but are not limited to SystemML[systemml], MLlib[mllib], etc. All these distributed systems are helping us parallelize deep learning training to reduce training time. Is it possible to further reduce the training time as well as increase the batch size while maintaining the test accuracy of the model?
As we discussed earlier, training of a deep learning model can be parallelized through either data parallelism or model parallelism. We cannot only rely on data parallelism because the model size (number of parameters and layers) is too big sometimes which makes model parallelism a mandatory option. Again, model parallelism can be done in two ways: 1) parallelize within each layer and 2) parallelize across different layers. Each of these approaches has problems. The problem with parallelizing within each layer is that the wide model is not efficient. A deep model can be better than a wide model. Also, the problem with parallelizing across different layers is that parallel efficiency is low which is 1/P. So, the solution is combining model parallelism with data parallelism.
Scaling a machine learning model is a difficult task. Most often, it results in the generalization problem. The generalization problem indicates we might end up in high training accuracy while test accuracy is very low. A promising solution can be auto-tuning the learning rate while scaling the batch size. Another idea is using different learning rates for different layers. This idea of layer-wise adapting the learning rate for increased batch size was first introduced by LARS[lars] for deep learning in systems such as TensorFlow. It showed that batch size can be increased significantly even on the CPU by the use of a layer-wise adaptive learning rate technique.
A standard value of learning rate is effective to train a model perfectly, but it always results in a long training time. Again, a larger value learning rate speeds up the training process, but it might result in under-fitting the model. Learning rate decay is a proven solution to mitigate this problem. The idea of learning rate decay is to start with an initial learning rate and to change the learning rate after every epoch at a constant rate. The problem is updating the learning rate at a constant rate after every epoch does not help much which inspires the invention of algorithms such as LARS[lars] and LAMB[lamb]. Inspired by the success of LARS and LAMB, our proposal is to apply these algorithms to speed up machine learning on top distributed machine learning systems such as SystemML. For this phase of our project, we focus only on LARS optimizer instead of trying both LARS and LAMB optimizers.
We selected SystemML in this project for two reasons. Firstly, this distributed machine system is not yet as optimized as TensorFlow or PyTorch for deep learning. The LARS implementation is not available for this framework. Our second reason for selecting SystemML is that it can be used on top of Apache Spark. Although distributed deep learning is supported by TensorFlow and PyTorch, Apache Spark seems to be very effective for inter-cluster communication. This is because Apache Spark is mainly developed for distributed and large scale systems. Spark RDDs are very fast and effective for remote memory access. The summary of our contribution is outlined below:
Applying LARS optimizer to a deep learning model implemented using SystemML.
Comparing test accuracy of LARS optimizer with that of Stochastic Gradient Descent.
Evaluating test and training accuracy by increasing batch size.
Reducing the generalization error using LARS optimizer.
Finding the challenges in applying LARS optimization technique to SystemML.
2 Related Work
The use of machine learning and deep learning-based algorithms has been increasing for a long time for various applications. Various platforms like Tensorflow, PyTorch, and Keras provide extensive functionality in order to support data parallelism and model parallelism. While data parallelism has its own disadvantage of not getting fit if the model is too large even if it can support accelerating the training process. The model parallelism has been proposed in various applications in order to solve this issue. Recently model parallelism has been successfully implemented for relational databases[declarative-rdbms]. This computation on an RDBMS has been implemented by applying machine learning-based algorithms. They proposed that different parts of the model can be stored in a set of tables and SQL queries can be used for computation purposes. RDBMS has been studied widely for distributed computing for a long time. The paper has also discussed challenges in involving RDBMS with ML/DL algorithms. They have introduced multi-dimensional array-like indices for database tables, the improved query optimizer. It uses SimSQL which scales to large model size and can outperform TensorFlow. It shows that model parallelism for RDBMS based ML system can be scaled well and perform better than even Tensorflow.
The paper LARS[lars]
focuses on speeding up deep neural network training. Their approach focuses on increasing batch size using Layer-wise Adaptive Rate Scaling(LARS) for efficient use of massive resources. They train Alexnet and Resnet-50 using the Imagenet-1k dataset while maintaining its state of the art accuracy. However, they successfully increase the batch size to larger than 16k, thereby, reducing the training time of 100 epoch Alexnet from hours to 11 minutes and the 90-epoch ResNet-50 from hours to 20 minutes. Their system, for 90 epoch Resnet 50 returns an accuracy of 75.4% with a batch size of 32k but reduces to 73.2% when it is 64k. Their system clearly shows the extent to which large scale computers are capable of accelerating the training of a DNN by using massive resources with standard Imagenet-1k.
they use a novel layerwise adaptive large batch optimization technique called LAMB. While LARS works well for Resnet, it performs poorly for attention models like BERT[bert]
. However, empirical results showed superior results with LAMB across systems like BERT with very little hyperparameter tuning. BERT could be trained with huge batch sizes of 32868 without degradation in performance. Hence, the training time of BERT reduced from 3 days to 76 minutes.
In this section, we define the architecture of our convolutional neural network at first. After that, we discuss LARS[lars] optimization technique in details.
3.1 Model Architecture
Our CNN model consists of 2 convolution layers followed by 3 fully connected layers including the classification layer. The first convolution layer has 6 filters of size 5 × 5 with zero padding. The second convolution layer has 16 filters of size 5×5 with zero padding. There are two MAX Pooling layers with 2x2 filter. Each pooling layer follows one convolution layer. The first and second fully connected layers have dimensions 120 and 84 respectively, while the third fully connected layer is of size 10. We apply ReLU activation function for all layers except the last layer, which has softmax activation. We apply cross-entropy loss function in the last layer. We don't apply any dropout to any layer. The architectural structure of our convolutional neural network is shown in Figure1.
3.2 LARS Optimizer
Standard Stochastic Gradient Descent (SGD)[sgd] uses same Learning Rate (LR) in each layer. The problems are two-fold. Firstly, if the learning rate is small, the convergence of training might take too much time. Sometimes, it might not even converge. Secondly, if the learning rate is high, the training process might end up with under-fitting of the model. In order to prevent under-fitting, the initial phase of training should be highly sensitive to the weight initialization and initial learning rate[lars]. LARS[lars] claims that the L2-norm of weights and gradients varies significantly between weights and biases, and between different layers. Training might become unstable if learning rate is large compared to this ratio in some layers. In order to solve this problem, there is a popular approach called learning rate warm-up. This approach starts with a small learning rate which can be used for any layer and slowly increases the learning rate.
Instead of maintaining a single global learning rate, LARS maintains separate local learning rate for each layer. Equation 3.2 is used to calculate the weight using the local and global learning rate.
In Equation 3.2, is the global learning rate and is the layer-wise local learning rate. Local learning rate for each layer is defined through a trust co-efficient which is much smaller than 1. Trust co-efficient defines how much a layer can be trusted to change its weight during update. Local learning rate can be calculated using Equation 3.2.
In Equation 3.2, is the trust coefficient. In case of Stochastic Gradient Descent, a term known as weight decay[weight-decay] can be used. Equation for calculating local learning rate can also be extended to balance the learning rate with weight decay according to Equation 3.2.
In Equation 3.2,
is the weight decay term. Local learning rate helps to partially eliminate vanishing and exploding gradient problems[lars].
4 Experimental Evaluation
In this section, we define evaluation strategy and report experimental results. At first, we show the configuration of our machine. After that, we state evaluation metric, evaluation dataset and parameter settings. Finally, we graphically report experimental results.
4.1 System Configuration
We run our experiments in a Intel(R) Core(TM) i7-4790 CPU. The CPU frequency is 3.6 GHz and has 8 cores. L1d, L1i, L2, and L3 caches are 32K, 32K, 256K and 8192K respectively. Operating system version is Ubuntu 18.04.3 LTS x86_64. We use 4 parallel batches to investigate the performance in a parallel and distributed setting.
4.2 Evaluation Metric and Parameter Setting
Dataset Used:MNIST Dataset[mnist-data]
We used three different evaluation metrics to evaluate the performance of our system which include Test Accuracy, Train Accuracy, and Generalization Error.
Generalization Error indicates the difference between training accuracy and test accuracy. Generalization error is high when a model gains very high training accuracy but low test accuracy. Normally, increasing batch size in stochastic gradient descent results in large generalization error. We select this evaluation metric because the goal of our system is to reduce the generalization error.
Parameter Setting: The values of various parameters used for our CNN model training is reported in table 1.
4.3 Experimental Results
This subsection reports all experimental results in a graphical way. Figure 2 compares the test accuracy achieved by both SGD[sgd] and LARS[lars] optimizers. From the figure, it is clear that both approaches perform extremely good for small batch sizes. Both optimizers maintain a test accuracy of around 99% until batch size reaches 1024. After that, test accuracy starts decreasing gradually although they maintain an accuracy of more than 90% until batch size reaches 8000. The difference between performances of these two optimizers is observable after that. The test accuracy of SGD optimizer starts dropping significantly once batch size reaches 16000 while LARS optimizer maintains much higher test accuracy compared to SGD. Test accuracy for SGD drops below 40% after 28000 batch size while LARS optimizer maintains more than 55% test accuracy for similar batch size. Test accuracy for both optimizers drops significantly once batch size reaches 32000 although LARS optimizer performs better yet.
Figure 3 compares the training accuracy achieved by both SGD and LARS. Similar to test accuracy in Figure 2, train accuracy is also higher for both approaches in case of small batch sizes. It is noticeable that train accuracy for SGD is better than its test accuracy. Still, it cannot outperform LARS for large batch sizes. Similar to test accuracy, train accuracy for SGD also decreases significantly once batch size reaches 16000. LARS outperforms SGD for both train accuracy and test accuracy if batch size is larger than or equal to 16000.
Figure 3 compares the generalization error for both SGD and LARS optimizers. As we stated earlier, generalization error is very important in our evaluation because it indicates whether a model is consistent in train accuracy and test accuracy. More generalization error indicates more inconsistency. That's why lower generalization error is always desired. Figure 4 indicates that generalization error for SGD is much higher than LARS optimizer. Although generalization error for SGD is within a desired range for batch sizes smaller than 8000, it increases significantly after that. It can be stated that generalization error for LARS is much smaller than SGD for batch sizes larger than 8000.
The encouragement towards training deep learning models with growing amount of data has been discussed thoroughly. Processing huge amount of data leads to so many real-time issues. The inability to use large batch size to utilize full system resources in highly configured environments is one of those issues which we focus here. We compare test accuracy of SGD and LARS optimizers for various batch sizes in a distributed deep learning setting using SystemML. For the ease of our evaluation, we choose a simple convolutional neural network model and MNIST[mnist-data] dataset for our experimental evaluation. Based on our evaluation, we can conclude that LARS optimizer can be used to train with large batch sizes with distributed machine learning systems such as SystemML.
6 Future Work and Challenges
We plan to extend this task by performing similar experiments with a large deep learning model such ResNet[resnet]. So far, we used MNIST[mnist-data] dataset for our experiments. We are also interested in evaluating our approach with a larger dataset such as ImageNet dataset[imagenet-data]. Since LAMB[lamb] optimizer proved to perform excellent for even large models such as BERT[bert], our another goal is to evaluate the performance of LAMB optimizer with distributed machine learning framework, SystemML.
We faced many challenges when finishing this task. Most of the challenges got raised due to many bugs of our adopted machine learning framework, SystemML. The latest release of this framework has several issues and raised errors once we tried to utilize some of its features. Despite all of these issues with SystemML, we hope to utilize this framework for our future experiments with complex model and large dataset.