is attracting much attention in the field of visual object recognition, speech recognition, object detection, and many other domains. It provides automatic feature extraction and has the ability to achieve outstanding performance[3, 4].
Deep learning uses a very deep layered network and a huge number of data, so overfitting is a serious problem. To avoid overfitting, regularization is used. Hinton et al. proposed a regularization method called “dropout learning”  for this purpose. Dropout learning follows two processes. At learning time, some hidden units are neglected with a probability , and this process reduces the network size. At test time, learned hidden units and those not learned are summed up and multiplied by to calculate the network output. We find that summing up the learned and not learned units multiplied by can be regarded as ensemble learning.
In this paper, we analyze dropout learning regarded as ensemble learning . On-line learning [7, 8] is used to learn a network. We analyze dropout learning regarded as ensemble learning, except for using different sets of of hidden units in dropout learning. We also analyze dropout learning regarded as an L2 normalizer .
In this paper, we use a teacher-student formulation and assume the existence of a teacher network (teacher) that produces the desired output for the student network (student). By introducing the teacher, we can directly measure the similarity of the student weight vector to that of the teacher. First, we formulate a teacher and a student, and then introduce the gradient descent algorithm.
The teacher and student are a soft committee machine with input units, hidden units, and an output, as shown in Fig. 1. The teacher consists of hidden units, and the student consists of
hidden units. Each hidden unit is a perceptron. Theth hidden weight vector of the teacher is , and the th hidden weight vector of student is , where denotes learning iterations. In the soft committee machine, all hidden-to-output weights are fixed to be . This network calculates the majority vote of hidden outputs.
We assume that both the teacher and the student receive -dimensional input , that the teacher outputs , and that the student outputs . Here, is the output function of a hidden unit, is the inner potential of the th hidden unit of the teacher calculated using , and is the inner potential of the th hidden unit of the student calculated using .
We assume that the th elements of the independently drawn input
th element of the input is drawn from a probability distribution. The thermodynamic limit of is also assumed. The statistics of the inputs in the thermodynamic limit are , , and , where denotes the average and denotes the norm of a vector. Each element is drawn from a probability distribution with zero mean and variance. With the assumption of the thermodynamic limit, the statistics of the teacher weight vector are , and . This means that any combination of . The distribution of inner potential
follows a Gaussian distribution with zero mean and unit variance in the thermodynamic limit.
For the sake of analysis, we assume that each element of , which is the initial value of the student vector , is drawn from a probability distribution with zero mean and variance. The statistics of the th hidden weight vector of the student are , and in the thermodynamic limit. This means that any combination of . The output function of the hidden units of the student is the same as that of the teacher. The statistics of the student weight vector at the th iteration are , , and . Here, . The distribution of the inner potential follows a Gaussian distribution with zero mean and variance in the thermodynamic limit.
Next, we introduce the stochastic gradient descent (SGD) algorithm for the soft committee machine. The generalization error is defined as the squared erroraveraged over possible inputs:
At each learning step , a new uncorrelated input, , is presented, and the current hidden weight vector of the student is updated using
where is the learning step size and is the derivative of the output function of the hidden unit .
On-line learning uses a new input at once, therefore, overfitting does not occur. To evaluate the dropout learning in on-line learning, pre-selected whole inputs frequently use in a on-line manner. From our experiences, when the input dimension is , then overfitting occurs for pre-selected whole inputs.
3 Dropout learning and ensemble learning
In this section, we compare dropout learning and ensemble learning regarded as a way of calculating network output.
3.1 Ensemble learning
Eensemble learning is performed by using many learners (referred to as students) to achieve better performance . In ensemble learning, each student learns the teacher independently, and each output is averaged to calculate the ensemble output .
Here, is a weight for averaging. is the number of students.
Figure 3 shows computer simulation results. The teacher and student include two hidden units. The output function is the error function . In the figure, the horizontal axis is time . Here, is the iteration number, and is the dimension of input units. Input dimension is , and inputs are frequently used. The vertical axis is the mean squared error (MSE) for input data. Each elements of the independently drawn input are uncorrelated random variables with zero mean and unit variance. Target for is the teacher output. The teacher and the initial student weight vectors are set as described in Sec. 2. In the figure, “Single” is the result of using a single student. “m2” is the result of using an ensemble of two students, “m3” is that of an ensemble of three students, and “m4” is that of ensemble of four students. As shown, the ensemble of four students outperformed the other two cases.
Next, we modify the ensemble learning. We divide the student (with hidden units) into networks (See Fig. 3. Here, and ). These divided networks learn the teacher independently, and then we calculate the ensemble output by averaging the outputs as:
3.2 Dropout learning
In this subsection, we introduce dropout learning . Dropout learning is used in deep learning to prevent overfitting. A small number of data compared with the size of a network may cause overfitting . In the state of overfitting, the learning error (the error for learning data) and the test error (the error by cross-validation) become different. Figure 4 shows the result of the SGD and that of dropout learning. The soft committee machine was used for both the teacher and student. was used as the output function . Input dimension is , and the teacher had two hidden units, and the student had 100 hidden units. The input and its target are generated as those of Fig.3. The learning step size was set to , and pieces of inputs were used iteratively for learning. In Fig. 4(a) shows the learning curve of the SGD. In this setting, overfitting occurred. Figure 4(b) shows the learning curve of the SGD with dropout learning. The learning error was small compared with the test error; however, the difference between the learning error and the test error was not as significant as that of the SGD. Therefore, these results shows that dropout learning prevent overfitting.
The learning equation of dropout learning for the soft committee machine can be written as the next equation.
Here, shows a set of hidden units that is randomly selected with respect to the probability from all the hidden units at the th iteration. The hidden units in are not subject to learning. After the learning, the student’s output is calculated by the sum of learned hidden outputs and those not learned multiplied by .
This equation is regarded as the ensemble of a learned network (the first term) and that of a not learned network (the second term) when the probability is . Equation 6 is correspond to Eq. (4) when and . However, a set of hidden units in is selected at random in every iteration. So, dropout learning is regarded as ensemble learning performed by using a different set of hidden units in every iteration. Instead, the original ensemble learning is the average of the fixed set of hidden units throughout the learning. This difference may cause the difference in performances between dropout learning and ensemble learning.
4.1 Comparison between dropout learning and ensemble learning
In this section, the error function is used as the output function . We compared dropout learning and ensemble learning. We used two soft committee machines with 50 hidden units for ensemble learning. For dropout learning, we used one soft committee machine with 100 hidden units. We set ; then, dropout learning selected 50 hidden units in with 50 unselected hidden units remaining. Therefore, dropout learning and ensemble learning had the same architectures. Input dimension is , and the learning step size was set to . The input and its target are generated as those of Fig.3. inputs were used iteratively for learning. Figure 5 shows the results. The horizontal axes is time , and the vertical axis is the MSE calculated for input data. In Fig. 5(a), “single” shows the soft-committee machines with 50 hidden units. “ensemble” shows the results given by ensemble learning. Test errors are used in these figures. In Fig. 5(b), “test” shows the MSE given by the test data. “learn” shows the MSE given by the learning data. Results are obtained by average of 10 trials. As shown in Fig. 5(a), the ensemble learning achieved an MSE smaller than that of the single network. However, dropout learning achieved an MSE smaller than that of ensemble learning. Therefore, ensemble learning using a different set of hidden units in every iteration (this is the dropout) performs better than when using the same set of hidden units throughout the learning. Note that even with dropout learning using more hidden units than ensemble learning, overfitting did not occur. Therefore, in the next subsection, we will compare dropout learning with the SGD with L2 regularization.
4.2 Comparison between dropout learning and SGD with L2 regularization
The next learning equation shows the SGD with L2 regularization.
Here, is a coefficient of the L2 penalty.
From comparison between Fig. 6 and Fig. 5(b), the residual error of dropout learning was almost the same as that of the SGD with L2 regularization. Therefore, the regularization effort of dropout learning is the same as the L2 regularization. Note that for the SGD with L2 regularization, we must choose in trials; however, dropout learning has no tuning parameter.
In this paper, we analyzed dropout learning regarded as ensemble learning. In ensemble learning, we divide the network into several sub-networks, and then we learn each sub-network independently. After the learning, the ensemble output is calculated by using the average of the sub-network outputs. We showed that dropout learning can be regarded as ensemble learning except for using a different set of hidden units in every learning iteration. Using a different set of hidden unit outperforms ensemble learning. We also showed that dropout learning achieves the same performance as the L2 regularizer. Our future work is the theoretical analysis of dropout learning with ReLU activation function.
The authors thank Dr. Masato Okada and Dr. Hideitsu Hino for insightful discussions.
-  G. E. Hinton, S. Osindero, and Y. Teh, “A fast learning algorithm for deep belief nets”, Neural Computation, 18, pp. 1527–1554 (2006).
-  Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning”, NATURE, vol. 521, pp. 436–444 (2015).
-  L. Deng, J. Li, et al., “Recent advances in deep learning for speech research at Microsoft”, ICASSP (2013).
-  G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov, “Improving neural networks by preventing co-adaptation of feature detectors”, The Computing Research Repository (CoRR), vol. abs/ 1207.0580 (2012).
-  K. Hara and M. Okada, “Ensemble Learning of Linear Perceptrons: On-Line Learning Theory”, Journal of the Physical Society of Japan, vol. 74, no. 11, pp. 2966–2972 (2005).
-  M. Biehl and H. Schwarze, “Learning by on-line gradient descent”, Journal of Physics A: Mathematical and General Physics, 28, 643–656 (1995).
-  D. Saad and S. A. Solla, “On-line learning in soft-committee machines”, Physical Review E, 52, pp. 4225–4243 (1995).
-  S. Wager, Sida Wang, and Percy Liang, “Dropout Training as Adaptive Regularization”, Advance in Neural Information Processing System 26 (2013).
-  C. M. Bishop, Pattern Recognition and Machine learning, Springer (2006).