Analysis of dropout learning regarded as ensemble learning

06/20/2017 ∙ by Kazuyuki Hara, et al. ∙ 0

Deep learning is the state-of-the-art in fields such as visual object recognition and speech recognition. This learning uses a large number of layers, huge number of units, and connections. Therefore, overfitting is a serious problem. To avoid this problem, dropout learning is proposed. Dropout learning neglects some inputs and hidden units in the learning process with a probability, p, and then, the neglected inputs and hidden units are combined with the learned network to express the final output. We find that the process of combining the neglected hidden units with the learned network can be regarded as ensemble learning, so we analyze dropout learning from this point of view.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning [1, 2]

is attracting much attention in the field of visual object recognition, speech recognition, object detection, and many other domains. It provides automatic feature extraction and has the ability to achieve outstanding performance

[3, 4].

Deep learning uses a very deep layered network and a huge number of data, so overfitting is a serious problem. To avoid overfitting, regularization is used. Hinton et al. proposed a regularization method called “dropout learning” [5] for this purpose. Dropout learning follows two processes. At learning time, some hidden units are neglected with a probability , and this process reduces the network size. At test time, learned hidden units and those not learned are summed up and multiplied by to calculate the network output. We find that summing up the learned and not learned units multiplied by can be regarded as ensemble learning.

In this paper, we analyze dropout learning regarded as ensemble learning [6]. On-line learning [7, 8] is used to learn a network. We analyze dropout learning regarded as ensemble learning, except for using different sets of of hidden units in dropout learning. We also analyze dropout learning regarded as an L2 normalizer [9].

2 Model

In this paper, we use a teacher-student formulation and assume the existence of a teacher network (teacher) that produces the desired output for the student network (student). By introducing the teacher, we can directly measure the similarity of the student weight vector to that of the teacher. First, we formulate a teacher and a student, and then introduce the gradient descent algorithm.

The teacher and student are a soft committee machine with input units, hidden units, and an output, as shown in Fig. 1. The teacher consists of hidden units, and the student consists of

hidden units. Each hidden unit is a perceptron. The

th hidden weight vector of the teacher is , and the th hidden weight vector of student is , where denotes learning iterations. In the soft committee machine, all hidden-to-output weights are fixed to be [8]. This network calculates the majority vote of hidden outputs.

Figure 1: Network structures of teacher and student

We assume that both the teacher and the student receive -dimensional input , that the teacher outputs , and that the student outputs . Here, is the output function of a hidden unit, is the inner potential of the th hidden unit of the teacher calculated using , and is the inner potential of the th hidden unit of the student calculated using .

We assume that the th elements of the independently drawn input

are uncorrelated random variables with zero mean and unit variance; that is, that the

th element of the input is drawn from a probability distribution

. The thermodynamic limit of is also assumed. The statistics of the inputs in the thermodynamic limit are , , and , where denotes the average and denotes the norm of a vector. Each element is drawn from a probability distribution with zero mean and variance. With the assumption of the thermodynamic limit, the statistics of the teacher weight vector are , and . This means that any combination of . The distribution of inner potential

follows a Gaussian distribution with zero mean and unit variance in the thermodynamic limit.

For the sake of analysis, we assume that each element of , which is the initial value of the student vector , is drawn from a probability distribution with zero mean and variance. The statistics of the th hidden weight vector of the student are , and in the thermodynamic limit. This means that any combination of . The output function of the hidden units of the student is the same as that of the teacher. The statistics of the student weight vector at the th iteration are , , and . Here, . The distribution of the inner potential follows a Gaussian distribution with zero mean and variance in the thermodynamic limit.

Next, we introduce the stochastic gradient descent (SGD) algorithm for the soft committee machine. The generalization error is defined as the squared error

averaged over possible inputs:

(1)

At each learning step , a new uncorrelated input, , is presented, and the current hidden weight vector of the student is updated using

(2)

where is the learning step size and is the derivative of the output function of the hidden unit .

On-line learning uses a new input at once, therefore, overfitting does not occur. To evaluate the dropout learning in on-line learning, pre-selected whole inputs frequently use in a on-line manner. From our experiences, when the input dimension is , then overfitting occurs for pre-selected whole inputs.

3 Dropout learning and ensemble learning

In this section, we compare dropout learning and ensemble learning regarded as a way of calculating network output.

3.1 Ensemble learning

Eensemble learning is performed by using many learners (referred to as students) to achieve better performance [6]. In ensemble learning, each student learns the teacher independently, and each output is averaged to calculate the ensemble output .

(3)

Here, is a weight for averaging. is the number of students.

Figure 3 shows computer simulation results. The teacher and student include two hidden units. The output function is the error function . In the figure, the horizontal axis is time . Here, is the iteration number, and is the dimension of input units. Input dimension is , and inputs are frequently used. The vertical axis is the mean squared error (MSE) for input data. Each elements of the independently drawn input are uncorrelated random variables with zero mean and unit variance. Target for is the teacher output. The teacher and the initial student weight vectors are set as described in Sec. 2. In the figure, “Single” is the result of using a single student. “m2” is the result of using an ensemble of two students, “m3” is that of an ensemble of three students, and “m4” is that of ensemble of four students. As shown, the ensemble of four students outperformed the other two cases.

Figure 2: Effect of ensemble learning
Figure 3: Network divided into two networks to apply ensemble learning

Next, we modify the ensemble learning. We divide the student (with hidden units) into networks (See Fig. 3. Here, and ). These divided networks learn the teacher independently, and then we calculate the ensemble output by averaging the outputs as:

(4)

Here, is the output of a divided network with hidden units, and is the th hidden output in the th divided network. Eq. (4) corresponds to Eq. (3) when and .

3.2 Dropout learning

In this subsection, we introduce dropout learning [5]. Dropout learning is used in deep learning to prevent overfitting. A small number of data compared with the size of a network may cause overfitting [10]. In the state of overfitting, the learning error (the error for learning data) and the test error (the error by cross-validation) become different. Figure 4 shows the result of the SGD and that of dropout learning. The soft committee machine was used for both the teacher and student. was used as the output function . Input dimension is , and the teacher had two hidden units, and the student had 100 hidden units. The input and its target are generated as those of Fig.3. The learning step size was set to , and pieces of inputs were used iteratively for learning. In Fig. 4(a) shows the learning curve of the SGD. In this setting, overfitting occurred. Figure 4(b) shows the learning curve of the SGD with dropout learning. The learning error was small compared with the test error; however, the difference between the learning error and the test error was not as significant as that of the SGD. Therefore, these results shows that dropout learning prevent overfitting.

Figure 4: Effect of dropout. (a) is learning curve of SGD, and (b) is that of dropout learning.

The learning equation of dropout learning for the soft committee machine can be written as the next equation.

(5)

Here, shows a set of hidden units that is randomly selected with respect to the probability from all the hidden units at the th iteration. The hidden units in are not subject to learning. After the learning, the student’s output is calculated by the sum of learned hidden outputs and those not learned multiplied by .

(6)

This equation is regarded as the ensemble of a learned network (the first term) and that of a not learned network (the second term) when the probability is . Equation 6 is correspond to Eq. (4) when and . However, a set of hidden units in is selected at random in every iteration. So, dropout learning is regarded as ensemble learning performed by using a different set of hidden units in every iteration. Instead, the original ensemble learning is the average of the fixed set of hidden units throughout the learning. This difference may cause the difference in performances between dropout learning and ensemble learning.

4 Results

4.1 Comparison between dropout learning and ensemble learning

In this section, the error function is used as the output function . We compared dropout learning and ensemble learning. We used two soft committee machines with 50 hidden units for ensemble learning. For dropout learning, we used one soft committee machine with 100 hidden units. We set ; then, dropout learning selected 50 hidden units in with 50 unselected hidden units remaining. Therefore, dropout learning and ensemble learning had the same architectures. Input dimension is , and the learning step size was set to . The input and its target are generated as those of Fig.3. inputs were used iteratively for learning. Figure 5 shows the results. The horizontal axes is time , and the vertical axis is the MSE calculated for input data. In Fig. 5(a), “single” shows the soft-committee machines with 50 hidden units. “ensemble” shows the results given by ensemble learning. Test errors are used in these figures. In Fig. 5(b), “test” shows the MSE given by the test data. “learn” shows the MSE given by the learning data. Results are obtained by average of 10 trials. As shown in Fig. 5(a), the ensemble learning achieved an MSE smaller than that of the single network. However, dropout learning achieved an MSE smaller than that of ensemble learning. Therefore, ensemble learning using a different set of hidden units in every iteration (this is the dropout) performs better than when using the same set of hidden units throughout the learning. Note that even with dropout learning using more hidden units than ensemble learning, overfitting did not occur. Therefore, in the next subsection, we will compare dropout learning with the SGD with L2 regularization.

Figure 5: Results of comparison between dropout learning and ensemble learning. (a) is ensemble learning of two networks, and (b) is dropout learning with respect to .

4.2 Comparison between dropout learning and SGD with L2 regularization

The next learning equation shows the SGD with L2 regularization.

(7)

Here, is a coefficient of the L2 penalty.

In Fig. 6, we show the learning results of the SGD with L2 regularization. Results are obtained by average of 10 trials. The conditions were the same as those of Fig. 5.

Figure 6: Learning curve of SGD with L2 normalization

From comparison between Fig. 6 and Fig. 5(b), the residual error of dropout learning was almost the same as that of the SGD with L2 regularization. Therefore, the regularization effort of dropout learning is the same as the L2 regularization. Note that for the SGD with L2 regularization, we must choose in trials; however, dropout learning has no tuning parameter.

5 Conclusion

In this paper, we analyzed dropout learning regarded as ensemble learning. In ensemble learning, we divide the network into several sub-networks, and then we learn each sub-network independently. After the learning, the ensemble output is calculated by using the average of the sub-network outputs. We showed that dropout learning can be regarded as ensemble learning except for using a different set of hidden units in every learning iteration. Using a different set of hidden unit outperforms ensemble learning. We also showed that dropout learning achieves the same performance as the L2 regularizer. Our future work is the theoretical analysis of dropout learning with ReLU activation function.

Acknowledgments

The authors thank Dr. Masato Okada and Dr. Hideitsu Hino for insightful discussions.

References