Deep learning methods 
aim to learn a multilayer neural network that can extract the feature hierarchies of the input data, by maximizing the likelihood of its training data. Their promise largely lies in the potential to use vast amounts of unlabeled data to learn complex, highly nonlinear models with millions of parameters. In recent competitions, deep learning methods have shown adavantanges over nearest-neighbor based (shallow) methods such kernel methods[2, 3] and ensemble methods [4, 5, 6].
In this paper, we consider a well-known machine learning model, deep belief networks (DBNs), that can learn hierarchical representations of their inputs. DBN has been applied to a number of machine learning applications, including speech recognition , visual object recognition [8, 9] and text processing , among others. In particular, DBN is especially well-suited to problems with high-dimensional inputs, over which it can infer rich models with many hidden layers. For example, when applied to images, a DBN can easily have tens of millions of free parameters, and ideally, we would want to use millions of unlabeled training examples to richly cover the input space.
It has been demonstrated that increasing the scale of deep learning, with respect to the number of training examples, the number of model parameters, or both, can drastically improve ultimate classification accuracy . Unfortunately, with most of the current algorithms, even training a moderate-sized DBN can take weeks using a conventional implementation on a single CPU . This is primarily due to the daunting computational requirements in DBN training — a large number of parameters need to be trained on the available examples.
To address the DBN scalability problem, this paper proposes an approach to scale up large-scale deep belief networks (DBNs) by adapting the idea of random dropout. Random dropout, proposed by Hinton et al. 
, was originally used to prevent complex co-adaptations on the training data in a single processor. On each training case, each hidden unit is randomly omitted from the network with a probability of 0.5, so a hidden unit cannot rely on other hidden units being present. By doing so, many separate DBNs are trained and then applied independently to the test data to reduce the predication bias of a single DBN.
Our approach extends the random dropout idea to the distributed and parallel setting. Rather than omitting a hidden unit with a probability of 0.5, our approach randomly drops out a portion of hidden units on each processor on each training case. To combine DBNs in each processor, our approach offers four different ways (Section 3.2):
performing model averaging with all trained DBNs.
using majority vote over the predication result of each trained DBN for each test case.
(each processor) synchronously updating its parameters after fetching the needed parameters from other processors.
(each processor) asynchronously fetching the computed parameters from other processors and pushing its computed parameters to other processors.
As validated in our preliminary evaluation, by using random dropout, our approach outperforms the state-of-the-art  DBN algorithms on the same data set, and have the potential to exhibit nearly linear speedup with its parallel implementation.
This paper makes the following contributions:
2 Related Work
Recently, many approaches have been developed to scale up machine learning algorithms within a machine (via multithreading) and across machines (via message passing) [14, 15, 16, 17]. Much of the existing work focuses on linear, convex models, and takes distributed gradient computation as the first step. Some other approaches relax synchronization requirements, exploring delayed gradient updates for convex problems 
, or exploring lock-less asynchronous stochastic gradient descent on shared-memory architectures (i.e. single machines).
Another way to scale up machine learning algorithms is to provide better abstractions and well-encapsulated computation tools. MapReduce  and GraphLab  are two notable examples. However, MapReduce, originally designed for parallel data processing, has a number of limitations for training deep belief network . On the other hand, GraphLab  was designed for general unstructured graph computations and does not exploit the computational effectiveness in a typical structured graph as in a deep belief network. Thus, it is still unknown whether the abstraction of GraphLab can be used for training large-scale DBNs.
In the deep learning community, some work has been done to train relatively small models on a single machine . In general, training a many-layer model is computationally intensive. Thus, full model parallelism as well as smart distributed optimization techniques is required. Recent years saw a surge of interest in scaling up the training and inference algorithms used for DBNs [19, 12] and in improving applicable optimization procedures . Existing approaches primarily fall into the following two categories.
Approaches in the first category use graphics processors (GPUs) [12, 8, 21] to achieve significant speedup for training moderate-sized DBNs. The use of GPUs has significantly reduced the computation time of matrix operations, which dominate most of the computation cost of deep learning algorithms. However, a known limitation of such GPU-based approaches is that the speedup will be small when the model does not fit in GPU memory (typically less than a few gigabytes). Thus, to effectively leverage a GPU, researchers often reduce the model size and the parameter number to alleviate the impact of lacking enough GPU memory. While data and parameter reduction work well for small problems (e.g. acoustic modeling for speech recognition ), they are less attractive for realistic problems with a large number of examples and dimensions (e.g., high-resolution images ).
Approaches in the second category use model parallelism to achieve scalability. For example, DistBelief  is a notable framework that enables model parallelism across machines (via message passing) , with the details of parallelism, synchronization and communication managed by the framework. Model parallelism under the DistBelief framework suffers a very large communication overhead due to the dense connections between layers of neurons. Data parallelism is also supported in DistBelief by using multiple replicas of a model to optimize a single objective. However, as pointed out by Hinton et al. 
, a large neural network, such as the one trained by DistBelief, can still perform poorly on held-out test data, if the relationship between the input and the correct output is complicated and the network has enough hidden units to model it accurately. In such cases, there will typically be many different settings of the weights that can model the training set almost perfectly. Each of these weight vectors will make different predictions on held-out test data and almost all of them will do worse on the test data than on the training data because the feature detectors have been tuned to work well together on the training data but not on the test data.
Our approach is inspired by those above mentioned approaches, and aims to address their limitations. With the goal of scaling up deep learning techniques to train very large DBNs, our approach combines the intrinsic parallelism in the ensemble learning algorithms, with the random dropout approach  to improve generalization results of neural networks. Using random dropout, our approach trains a separate DBN (much smaller than the original DBN) on an individual (graphical) processor in a large cluster, and then combines their results using four proposed methods. Compared to existing approaches, our random dropout-based approach has several noticeable benefits. First, it becomes possible to train a huge number of different networks in a reasonable time, since the number of parameters to be trained on a single machine is much smaller than the original DBN. Second, our approach permits to better use the modern GPU memory due to the reduced model size. Third, data transferring between processors would incur less communication overhead.
3 Proposed Approach
To train large DBNs, we propose an approach that supports distributed computation in neural networks. At a high level, our approach consists of two steps: model parallelism (Section 3.1) and model combination (Section 3.2). In the first model parallelism step, our approach automatically parallelizes computation in each machine using all available resources, and manages communication, synchronization and data transfer between machine. In the second step, our approach supports four different ways to combine results from each machine to form a unified DBN.
3.1 Model Parallelism
Figure 1 illustrates the model parallelism step on two machines.
It is worthy noting that the computational needs of training a DBN on one machine depends on its connectivity structure, and random dropout can significantly reduce the complexity of a DBN as well as the number of parameters: dropping out 50% of neurons at each layer can lead to a reduction of 75% of the parameters (connection weights). In general, given a dropout probability , our approach permits model parallelism among at most machines, with each machine updating a disjoint portion of weight matrix. This is fundamentally different from existing model parallelism approaches . For example, in the DistBelief  framework, a user needs to define the computation that takes place at each machine in each layer of the model, while our approach distributes the computation of training each DBN fully automatically to each machine. In addition, for a framework like DistBelief , the complexity of a DBN is not reduced; rather, a DBN is partitioned (as a graph) into available machines, and each machine must communicate frequently (with the parameter server) for updating weights. Therefore, large models with high computational demands might benefit from access to more CPUs and memory at the beginning, but will be limited by the bottleneck where communication costs dominate at some point. By contrast, in our approach, DBNs produced by random dropout tend to be more amenable to extensive distribution than fully-connected structures, given their less complex structures and lower communication requirements. Doing so will also help alleviate the bottleneck in which many machines are waiting for the single slowest machine to finish a given phase of computation.
3.2 Model Combination
Our approach provides four ways to combine the trained DBN from each machine:
Averaging weights of the trained DBN in each machine.
Majority voting of the predication result for test data using the trained DBN in each machine (our implementation breaks possible cycles by arbitrarily ranking the predication results, but this did not occur in our experiments).
Synchronously updating parameter weights during DBN training in each machine.
Asynchronously updating parameter weights during DBN training in each machine.
The first two ways for model combination are straightforward, and are omitted in this paper for space reasons. We next describe how to update parameter weights (a)synchronously.
Figure 2 sketches the asynchronous parameter weight updating algorithm in each machine on each training data. This lock-free asynchronous algorithm prevents each machine from waiting for others to finish before proceeding to the next training data, while sacrificing the data consistency of parameters – it is possible that two machines are updating the same parameters simultaneously without an explicit ordering. As a consequence, this asynchronous algorithm may introduce additional stochasticity in training.
Figure 3 sketches the synchronous parameter weight updating algorithm in each machine. In this algorithm, there is a central parameter server storing weights of all parameters. At the end of each mini-batch (a set of training data), each machine sends a request to the parameter server to fetch the needed parameter weights. If other machines are updating the requested parameters, this machine needs to wait until all machines finish their updates. Comparing to the asynchronous algorithm, this algorithm eliminates possible data races and improves data consistency of the same parameters, but introduces higher overhead.
We implemented our approach in a prototype using Matlab and Python. Our prototype uses the Theano library to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. Specifically, our implementation uses Theano to achieve significant speedup in data-intensive calculations by leveraging the transparent use of a GPU. When combining the trained DBNs, our implementation uses inter-process communication (IPC)  to exchange data among multiple threads in one or more processes. Our implementation is publicly available at: http://deeplearning.googlecode.com.
4.1 Details of dropout training on MINST dataset
The architectures of a deep network (the number of layers and the number of hidden units in each layer) varies on different benchmark tasks. Our first step is to develop a prototype and evaluate its effectiveness on the MNIST handwritten digits dataset , which consists of 28
28 digit images, - 60,000 for training and 10,000 for testing. The objective is to classify the digit images into their correct digit class. We experimented a neural network with size 784–500–500–2000–10, and pre-trained that network as a layer-wise Restricted Boltzmann Machine (RBM) for 50 epochs. Here one epoch means a pass through training data. During the pre-training phase, we employs a greedy Contrastive Divergence–1 learning algorithm. The learning rate is exponentially decaying, with initial value 10.0 and a decay rate 0.998 for each epoch of training. Weights were updated at the end of each mini-batch of size 100. Momentum is also used, with an initial value of 0.5, and a linear increasing rate of 0.001 over the first 490 epochs, after which it stays at 0.99. For the fine tuning using back-propagation with dropout for 200 epochs, we employed the stochastic gradient descent with mini-batches of size 100. We use the same dropout ratesfor all hidden units and dropout for input pixels. A constant learning rate 1.0 was used and there’s no constraints imposed on the norms of weight vectors.
4.2 Generalization Performance as a function of dropout probability
In the original dropout  article, Hinton et al claims that a better generalization performance can be achieved with various dropout probabilities. With implementation details shown in section 4.1, here we show how the test error rate varies as a function of dropout probability. As demonstrated in figure 4, dropout does decrease the test error rate by about 0.1% (10 less misclassified examples in the test data set). Contrary to their claim, we found that the generation performance of dropout actually depends on the dropout probability. When the dropout probability is greater than 0.6, the test error rate increases significantly. Such inconsistency might be due to the much smaller training epochs (1,000 epochs used in Hinton’s paper) used in our implementation.
4.3 Generalization Performance of our distributed algorithm
Here we evaluate the generalization performance on the MNIST dataset using various algorithms to combine results from different machines, as listed in section 3.2,.
|Algorithms||Sequential||Weight Averaging||Majority Vote||Synchronous Update||Asynchronous Update|
|Error Rate (%)||1.08||0.98||1.04||0.97||1.06|
Both weight averaging and synchronous update algorithms achieved notable improvement in the generalization performance. Surprisingly the majority vote method didn’t reduce the test error rate by a large margin. The asynchronous update algorithm introduced additional noise in the weight updates by its lock free mechanism, thus its generalization performance was relatively the same as the sequential dropout algorithm. However, as exhibited in figure 5, the convergence rate for both synchronous and asynchronous update algorithms are faster than the sequential dropout algorithm.
Our current evaluation was performed on a desktop with a Dual Core Intel E7400 processor, 3GB RAM, and a NVIDIA 8800GS graphics card. Pretraining/fine tuning are generally very time consumption on this machine. Due to time constraints, we were only able to evaluate our proposed algorithm on the relatively small MNIST dataset. However, we plan to further evaluate our algorithm on other speech and object recognition benchmark tasks such as the TIMIT Acoustic-Phonetic Continuous Speech Corpus , Reuters Corpus for news article topic recognition 
, and the ImageNet dataset of millions of labeled images in thousands of categories.
This paper proposes an approach to scale up deep belief networks (DBNs). At the core of our approach is the use of random dropout to prevent co-adaptions on the training data for a DBN, reduce overfitting, and enable DBN training to use the computational power of clusters in a distributed environment. Empirically, our implementation outperforms some state-of-the-art approaches , and promising nearly linear speedups. Furthermore, our approach allows parallel training of a DBN even when some gradients are computationally intensive.
For future work, it would be interesting to compare our approach with other approaches using different abstractions [20, 29]. For example, the PowerGraph  abstraction exploits the internal structure of graph programs to address the challenges of computation on natural graphs. Thus, it may be possible to adapt a similar random dropout idea to reduce memory consumption and communications between processors. An investigation into how to generalize this approach to other structures and problems would enable even faster computation of machine learning problems.
-  Geoffrey E. Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm for deep belief nets. Neural Comput., 18(7):1527–1554, July 2006.
-  Songlin Zhao. From fixed to adaptive budget robust kernel adaptive filtering. University of Florida, 2012.
-  Bernhard Schölkopf and Alex Smola. Support vector machines. Encyclopedia of Biostatistics, 1998.
-  Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119–139, 1997.
Tianqi Chen, Hang Li, Qiang Yang, and Yong Yu.
General functional matrix factorization using gradient boosting.In Proceeding of 30th International Conference on Machine Learning (ICML’13), volume 1, pages 436–444, 2013.
-  Nan Wang. A new boosting algorithm based on dual averaging scheme. International Journal of Innovative Science and Modern Engineering, 3(9):18–22, 2015.
-  Li Deng, Dong Yu, and J. Platt. Scalable stacking and learning for building deep architectures. In Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on, pages 2133 –2136, march 2012.
-  Dan Claudiu Ciresan, Ueli Meier, Luca Maria Gambardella, and Jürgen Schmidhuber. Deep big simple neural nets excel on handwritten digit recognition. CoRR, abs/1003.0358, 2010.
-  Jia Wu, Raymond Tse, and Linda G Shapiro. Learning to rank the severity of unrepaired cleft lip nasal deformity on 3d mesh data. In Pattern Recognition (ICPR), 2014 22nd International Conference on, pages 460–464. IEEE, 2014.
-  Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. A neural probabilistic language model. J. Mach. Learn. Res., 3:1137–1155, March 2003.
-  Quoc Le, Jiquan Ngiam, Adam Coates, Ahbik Lahiri, Bobby Prochnow, and Andrew Ng. On optimization methods for deep learning. In Lise Getoor and Tobias Scheffer, editors, Proceedings of the 28th International Conference on Machine Learning (ICML-11), ICML ’11, pages 265–272, New York, NY, USA, June 2011. ACM.
Rajat Raina, Anand Madhavan, and Andrew Y. Ng.
Large-scale deep unsupervised learning using graphics processors.In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, pages 873–880, New York, NY, USA, 2009. ACM.
-  Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. CoRR, pages –1–1, 2012.
-  John Langford, Er J. Smola, and Martin Zinkevich. Slow learners are fast. In In NIPS, pages 2331–2339, 2009.
-  Gideon Mann, Ryan Mcdonald, Mehryar Mohri, Nathan Silberman, and Daniel D. Walker. Efficient large-scale distributed training of conditional maximum entropy models. In In Advances in Neural Information Processing Systems, 2009.
Ryan McDonald, Keith Hall, and Gideon Mann.
Distributed training strategies for the structured perceptron.In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, HLT ’10, pages 456–464, Stroudsburg, PA, USA, 2010. Association for Computational Linguistics.
-  Martin Zinkevich, Markus Weimer, Alex Smola, and Lihong Li. Parallelized stochastic gradient descent. In J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R.S. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems 23, pages 2595–2603, 2010.
-  Feng Niu, Benjamin Recht, Christopher Re, and Stephen J. Wright. Hogwild!: A lock-free approach to parallelizing stochastic gradient descent. In In NIPS, pages 2331–2339, 2011.
-  Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on large clusters. Commun. ACM, 51(1):107–113, January 2008.
-  Joseph E. Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin. Powergraph: Distributed graph-parallel computation on natural graphs. In Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’12), Hollywood, October 2012.
-  George E. Dahl, Student Member, Dong Yu, Senior Member, Li Deng, and Alex Acero. Context-dependent pre-trained deep neural networks for large vocabulary speech recognition. In IEEE Transactions on Audio, Speech, and Language Processing, 2012.
-  L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, and B. Kingsbury. Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine, 2012.
-  Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc Le, Mark Mao, Marc’Aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, and Andrew Ng. Large scale distributed deep networks. In Advances in Neural Information Processing Systems 25, 2012.
-  The Theano Library. http://deeplearning.net/software/theano/.
-  W. Richard Stevens, Bill Fenner, and Andrew M. Rudoff. UNIX Network Programming, Vol. 1. Pearson Education, 3 edition, 2003.
-  A. Mohamed, G. Dahl, and G. Hinton. Acoustic modeling using deep belief networks. IEEE Transactions on Audio, Speech, and Language Processing, 20(14), 2012.
-  Tony Rose, Mark Stevenson, and Miles Whitehead. The reuters corpus volume 1 - from yesterday s news to tomorrow s language resources. In In Proceedings of the Third International Conference on Language Resources and Evaluation, pages 29–31, 2002.
-  Jia Deng, Wei Dong, Richard Socher, Li jia Li, Kai Li, and Li Fei-fei. Imagenet: A large-scale hierarchical image database. In In CVPR, 2009.
-  Aapo Kyrola, Guy Blelloch, and Carlos Guestrin. Graphchi: Large-scale graph computation on just a pc. In Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’12), Hollywood, October 2012.