Deep Learning  has become a powerful machine learning model. It differs from traditional machine learning approaches in the following aspects: Firstly, Deep Learning contains multiple non-linear hidden layers and can learn very complicated relationships between inputs and outputs. Deep architectures using multiple layers outperform shadow models . Secondly, there is no need to extract human design features , which can reduce the dependence of the quality of human extracted features. We mainly study three Deep Learning models in this work: Deep Neural Networks (DNN), Deep Belief Network (DBN) and Convolution Neural Network (CNN).
Several unsupervised pretraining methods for neural network have been proposed to improve the performance of random initialized DNN, such as using stacks of RBMs (Restricted Boltzmann Machines)4], or DBM (Deep Boltzmann Machines) 
. Compared to random initialization, pretraining followed with finetuning backpropagation will improve the performance significantly. Deep Belief Network (DBN) is a generative unsupervised pretraining network which uses stacked RBMs during pretraining. A DNN with a corresponding configured DBN often produces much better results. DBN has undirected connections between its first two layers and directed connections between all its lower layers .
Convolution Neural Network (CNN)    has been proposed to deal with images, speech and time-series. This is because standard DNN has some limitations. Firstly, images, speeches are usually large. A simple Neural Network to process an image size of with 1 layer of 100 hidden neurons will require 1,000,000 () weight parameters. With so many variables, it will lead to overfitting easily. Computation of standard DNN model requires expensive memory too. Secondly, standard DNN does not consider the local structure and topology of the input. For example, images have strong 2D local structure. Many areas in the image are similar. Speeches have a strong 1D structure, where variables temporally nearby are highly correlated. CNN forces the extraction of local features by restricting the receptive fields of hidden neurons to be local .
However, the training process for deep learning algorithms, including DNN, DBN, CNN, is computationally expensive. This is due to the large number of training data and a large number of parameters for multiple layers. Inspired from the shrinking technique   used in accelerating computation of Support Vector Machines (SVM) algorithm and screening  
technique used in LASSO, we propose an accelerating algorithm shrinking Deep Learning with Recall (sDLr). The main contribution of sDLr is that it can reduce the running time significantly. Though there is a trade-off between classification improvement and speedup on training time, for some data sets, sDLr approach can even improve classification accuracy. It should be noted that the approach sDLr is a general model and a new way of thinking, which can be applied to both large data, large network and small data small network, both sequential and parallel implementations. We will study the impact of proposed accelerating approaches on DNN, DBN and CNN using 4 data sets from computer vision and high energy physics, biology science.
The amount of data in our world has been exploding. Analyzing large data sets, so-called big data, will become a key basis of competition, underpinning new waves of productivity growth, innovation, and consumer interest . A lot of big data technologies, including cloud computing, dimensionality reduction have been proposed [15, 16, 17, 18, 19]. Analyzing big data with machine learning algorithms requires special hardware implementations and large amount of running time.
SVM  solves the following optimization problem:
where is a training sample, is the corresponding label, is positive slack variable, is mapping function, gives the solution and is known as weight vector, controls the relative importance of maximizing the margin and minimizing the amount of the slack. Since SVM learning problem has much less support vectors than training examples, shrinking   was proposed to eliminate training samples for large learning tasks where the fraction of support vectors is small compared to the training sample size or when many support vectors are at the upper bound of Lagrange multipliers.
LASSO  is an optimization problem to find sparse representation of some signals with respect to a predefined dictionary. It solves the following problem:
where is a testing point, is a dictionary with dimension and size , is a parameter controls the sparsity of representation . When both and are large, which is usually the case in practical applications, such as denoising or classification, it is difficult and time-intensive to compute. Screening   is a technique used to reduce the size of dictionary using some rules in order to accelerate the computation of LASSO.
Either in shrinking of SVM or in screening of LASSO, these approaches are trying to reduce the size of computation data. Inspired from these two techniques, we propose a faster and reliable approach for deep learning, shrinking Deep Learning.
Iii Shrinking Deep Learning
Given testing point , , let class indicator vector be , where is number of testing samples, is number of classes, has all s except one to indicate the class of this test point. Let the output of a neural network for testing point be . contains continuous values and is the th row of .
Iii-a Standard Deep Learning
gives the framework of standard deep learning. During each epoch (iteration), standard deep learning first runs a forward-propagation on all training data, then computes the output, where output is a function of weight parameters . Deep learning tries to find an optimal to minimize error loss , which can be sum squared error loss (DNN, DBN in our experiment) or softmax loss (CNN in our experiment). In backpropagation process, deep learning updates weight parameter vector using gradient descent. For an training data , gradient descent can be denoted as:
where is step size.
Before we present shrinking Deep Leaning algorithm, we first give Lemma 1.
Magnitude of gradient in Eq.(3) is positive correlated with the error .
In the case of sum squared error, error loss of sample is given as:
Using Eq.(4), gradient is:
As we can see from Eq.(5), is linear related to . Data points with larger error will have larger gradient, thus will have a stronger and larger correction signal when updating . Data points with smaller error will have smaller gradient, thus will have a weaker and smaller correction signal when updating .
In the case of softmax loss function,is denoted as:
Using Eq.(6), gradient is:
Now let’s see the relation between softmax loss function (Eq.(6)) and its gradient with respect to weight parameter (Eq.(8)). For example, given point is in class , so and . When is large, , softmax loss function (Eq.(6)) is very small. For gradient of softmax loss function (Eq.(8)), when , is close to ; when , is also close to . In summary, when softmax loss function (Eq.(6)) is very small, its gradient (Eq.(8)) is also very small.
Iii-B Shrinking Deep Learning
In order to accelerate computation and inspired from techniques of shrinking in SVM and screening of LASSO, we propose shrinking Deep Learning in Algorithm 2 by eliminating samples with small error (Eq.(4)) from training data and use less data for training.
Algorithm 2 gives the outline of shrinking Deep Learning (sDL). Compared to standard deep learning in Algorithm 1, sDL requires two more inputs, elimination rate and stop threshold . is a percentage indicating the amount of training data to be eliminated during one epoch, is a number indication to stop eliminating training data when , where is current number of training data. We maintain an index vector . In Algorithm 1, both forward and backward propagation apply on all training data. In Algorithm 2, the training process is applied on a subset of all training data. In the first epoch, we set to include all training indexes. After forward and backward propagation in each epoch, we select the indexes of training data with smallest error , where is size of current number of training data . Then we eliminate indexes in from , and update , . When , we stop eliminating training data anymore. Lemma 1 gives theoretical foundation that samples with small error will smaller impact on the gradient. Thus eliminating those samples will not impact the gradient significantly. Figure 2 shows that the errors using sDL is smaller than errors using DL, which proves that sDL gives a stronger correction signal and reduce the errors faster.
When eliminating samples, elimination rate denotes the percentage of samples to be removed. We select the indexes of training data with smallest error . For the same epoch, in different batches, the threshold used to eliminate samples is different. Assume there are batches one epoch, in every batch, we need to drop samples on average. In batch , let the threshold to drop smallest error be ; in batch , let the threshold be . and will differ a lot. We use exponential smoothing  to adjust the threshold used in batch : instead of using as the threshold to eliminate samples, we use the following :
where is a weight parameter which controls the importance of past threshold values, . The intuition using exponential smoothing is that we want the threshold used in each epoch to be consistent. Samples with errors less than in batch will be eliminated. If is close to 0, the smoothing effect on threshold is not obvious; if is close to 1, the threshold will deviate a lot from . In practical, we find between and is a good setting in terms of smoothing threshold. We will show this in experiment part.
Iv Shrinking with Recall
As the training data in sDL becomes less and less, the weight parameter trained is based on the subset of training data. It is not optimized for the entire training dataset. We now introduce shrinking Deep Learning with recall (Algorithm 3) to deal with this situation. In order to utilize all the training data, when the number of active training samples , we start to use all training samples, as shown in Algorithm 3, . Algorithm 3 ensures that the model trained is optimized for the entire training data. Shrinking with recall of Algorithm 3 will produce competitive classification performance with standard Deep Learning of Algorithm 1. In experiment, we will also investigate the impact the threshold on the classification results (see Figure 7).
|Dataset||Dimensionality||Training Set||Testing Set|
|MNIST||784 ( grayscale)||60K||10K|
|CIFAR-10||3072 ( color)||50K||10K|
|Training time (s)||1653||805||1627||700||3042||1431|
In experiment, we test our algorithms on data sets of different domains using 5 different random initialization. The data sets we used are listed in Table I. MNIST is a standard toy data set of handwritten digits; CIFAR-10 contains tiny natural images; Higgs Boson is a dataset from high energy physics. Alternative Splicing is RNA features used for predicting alternative gene splicing. We use DNN and DBN implementation from  and CNN implementation from . All experiments were conducted on a laptop with Intel Core i5-3210M CPU 2.50GHz, 4GB RAM, Windows 7 64-bit OS.
V-a Results on MNIST
MNIST is a standard toy data set of handwritten digits containing 10 classes. It contains 60K training samples and 10K testing samples. The image size is (grayscale ). Figure 2(a) shows some examples of MNIST dataset.
V-A1 Deep Neural Network
In experiment, we first test on some network architecture and find a better one for our further investigations. Figure 3(a) and Figure 3(b) show the testing and training classification error for different network settings. Results show that “” is a better setting with lower testing error and converges faster in training. We will use network “
” for DNN and DBN on MNIST. Learning rate is set to be 1; activation function is tangent function and output unit is sigmoid function.
Figure 5 shows the testing error and training error of using standard DNN, sDNN (Shrinking DNN) and sDNNr (shrinking DNN with recall). Results show that sDNNr improves the accuracy of standard DNN. While for training error, both DNN and sDNNr give almost 0 training error.
Figure 6 shows training time and number of active samples in each iteration (epoch). In our experiments, for sDNN and sDNNr, we set eliminate rate . sDNNr has a recall process to use the the entire training samples, as shown in Figure 6. When the number of active samples is less than of total training samples, we stop eliminating samples. The speedup using sDNNr compared to DNN is
Recall is a technique when the number of training samples is decreased to a threshold , we start to use all training samples. There is a trade-off between speedup and classification error: setting a lower could reduce computation time more, but could increase classification error. Figure 7 shows the effect of using different recall threshold sDNNr on MNIST data. When we bring all training samples back at , we get the best testing error. It is worth noting that the classification error of sDNNr is improved compared to standard DNN, which could imply that there is less overfitting for this data set.
V-A2 Deep Belief Network
Figure 9 shows the classification testing error and training time of using Deep Belief Network (DBN) and shrinking DBN with recall (sDBNr) on MNIST. Network setting is same as it is in DNN experiment. sDBNr further reduces the classification error of DBN to by using sDBNr.
V-A3 Convolution Neural Networks (CNN)
The network architecture used in MNIST is 4 convolutional layers with each of the first 2 convolutional layers followed by a max-pooling layer, then 1 layer followed by a ReLU layer, 1 layer followed by a Softmax layer. The first 2 convolutional layers have
receptive field applied with a stride of 1 pixel. The 3rd convolutional layer hasreceptive field and the 4th layer has receptive field with a stride of 1 pixel. The max pooling layers pool regions at strides of 2 pixels. Figure 10 shows the classification testing error and training time of CNN on MNIST data.
Table II summarizes the classification error improvement (IMP) and training time speedup of DNN, DBN and CNN on MNIST data, where improvement is .
|Testing error (top 1)||0.2070||0.2066|
|Training time (s)||5571||3565|
V-B Results on CIFAR-10
CIFAR-10  data contains 60,000 color image in 10 classes, with 6,000 images per class. There are 50,000 training and 10,000 testing images. CIFAR-10 is an object dataset, which includes airplane, car, bird, cat and so on and classes are completely mutually exclusive. In our experiment, we use CNN network to evaluate the performance in terms of classification error. Network architecture uses 5 convolutional layers: for the first three layers, each convolutional layer is followed by a max pooling layer; th convolutional layer is followed by a ReLU layer; the 5th layer is followed by a softmax loss output layer. Table III shows the classification error and training time. Top-1 classification testing error in Table III
means that the predict label is determined by considering the class with maximum probability only.
|Training time (s)||21||13|
|Training time (s)||52||18|
|Training time (s)||32||20|
V-C Results on Higgs Boson
Higgs Boson is a subset of data from  with training and testing. Each sample is a signal process which either produces Higgs bosons particle or not. We use 7 high-level features derived by physicists to help discriminate particles between the two classes. Both activation function and output function were sigmoid function. The DNN batchsize is and recall threshold . We test on different network settings and choose the best. Table IV shows the experiment results using different network.
V-D Results on Alternative Splicing
Alternative Splicing  is a set of RNA sequences used in bioinfomatics. It contains 3446 cassette-type mouse exons with 1389 features per exon. We randomly select 2500 exons for training and use the rest for testing. For each exon, the dataset contains three real-valued positive prediction targets , corresponding to probabilities that the exon is more likely to be included in the given tissue, more likely to be excluded, or more likely to exhibit no change relative to other tissues. To demonstrate the effective of proposed shrinking Deep Learning with recall approach, we use a simple DNN network of different number of layers and neurons with optimal tangent activation function and sigmoid output function. We use the following average sum squared error criteria to evaluate the model performance , where is the predict vector label and is the ground-truth label vector, is number of samples. The DNN batchsize is and recall threshold . We test on different network settings and choose the best. Table V shows the experiment result.
In conclusion, we proposed a shrinking Deep Learning with recall (sDLr) approach and the main contribution of sDLr is that it can reduce the running time significantly. Extensive experiments on 4 datasets show that shrinking Deep Learning with recall can reduce training time significantly while still gives competitive classification performance.
-  G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507, 2006.
-  Y. Bengio, “Learning deep architectures for ai,” Foundations and trends® in Machine Learning, vol. 2, no. 1, pp. 1–127, 2009.
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” inAdvances in neural information processing systems, 2012, pp. 1097–1105.
P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol, “Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion,”The Journal of Machine Learning Research, vol. 11, pp. 3371–3408, 2010.
R. Salakhutdinov and G. E. Hinton, “Deep boltzmann machines,” in
International Conference on Artificial Intelligence and Statistics, 2009, pp. 448–455.
G. E. Hinton, “Training products of experts by minimizing contrastive divergence,”Neural computation, vol. 14, no. 8, pp. 1771–1800, 2002.
-  L. Deng, “Three classes of deep learning architectures and their applications: a tutorial survey,” APSIPA transactions on signal and information processing, 2012.
-  Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
-  Y. LeCun and Y. Bengio, “Convolutional networks for images, speech, and time series,” The handbook of brain theory and neural networks, vol. 3361, no. 10, 1995.
-  T. Joachims, “Making large scale svm learning practical,” Universität Dortmund, Tech. Rep., 1999.
-  J. Narasimhan, A. Vishnu, L. Holder, and A. Hoisie, “Fast support vector machines using parallel adaptive shrinking on distributed systems,” arXiv preprint arXiv:1406.5161, 2014.
-  J. Wang, J. Zhou, P. Wonka, and J. Ye, “Lasso screening rules via dual polytope projection,” in Advances in Neural Information Processing Systems, 2013, pp. 1070–1078.
-  A. Bonnefoy, V. Emiya, L. Ralaivola, and R. Gribonval, “A dynamic screening principle for the lasso,” in Signal Processing Conference (EUSIPCO), 2014 Proceedings of the 22nd European. IEEE, 2014, pp. 6–10.
-  J. Manyika, M. Chui, B. Brown, J. Bughin, R. Dobbs, C. Roxburgh, and A. H. Byers, “Big data: The next frontier for innovation, competition, and productivity,” 2011.
-  S. Zheng, X. Cai, C. H. Ding, F. Nie, and H. Huang, “A closed form solution to multi-view low-rank regression.” in AAAI, 2015, pp. 1973–1979.
-  S. Zheng and C. Ding, “Kernel alignment inspired linear discriminant analysis,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer Berlin Heidelberg, 2014, pp. 401–416.
-  D. Williams, S. Zheng, X. Zhang, and H. Jamjoom, “Tidewatch: Fingerprinting the cyclicality of big data workloads,” in IEEE INFOCOM 2014-IEEE Conference on Computer Communications. IEEE, 2014, pp. 2031–2039.
-  X. Zhang, Z.-Y. Shae, S. Zheng, and H. Jamjoom, “Virtual machine migration in an over-committed cloud,” in 2012 IEEE Network Operations and Management Symposium. IEEE, 2012, pp. 196–203.
-  S. Zheng, Z.-Y. Shae, X. Zhang, H. Jamjoom, and L. Fong, “Analysis and modeling of social influence in high performance computing workloads,” in European Conference on Parallel Processing. Springer Berlin Heidelberg, 2011, pp. 193–204.
J. A. Suykens and J. Vandewalle, “Least squares support vector machine classifiers,”Neural processing letters, vol. 9, no. 3, pp. 293–300, 1999.
-  R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society. Series B (Methodological), pp. 267–288, 1996.
-  E. S. Gardner, “Exponential smoothing: The state of the art,” Journal of forecasting, vol. 4, no. 1, pp. 1–28, 1985.
-  R. B. Palm, “Prediction as a candidate for learning deep hierarchical models of data,” Technical University of Denmark, 2012.
-  A. Vedaldi and K. Lenc, “Matconvnet-convolutional neural networks for matlab,” arXiv preprint arXiv:1412.4564, 2014.
-  A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” 2009.
-  P. Baldi, P. Sadowski, and D. Whiteson, “Searching for exotic particles in high-energy physics with deep learning,” Nature communications, vol. 5, 2014.
-  H. Y. Xiong, Y. Barash, and B. J. Frey, “Bayesian prediction of tissue-regulated splicing using rna sequence and cellular context,” Bioinformatics, vol. 27, no. 18, pp. 2554–2562, 2011.