The main idea behind this paper is to use insights from behavioral psychology to improve training methods of artificial neural networks. Behavioral psychology focuses on how humans and animals behave and how behavior connects to learning and growth. By bringing this insight into machine learning, it may be possible to create methods that train an artificial neural network in a similar manner to how humans and animals are taught, possibly resulting in better speed and performance of the neural network.
This paper proposes such a method: dual variable learning rates, or DVLR. Dual learning rates are used in DVLR to provide different emphasis for correct and incorrect responses and thus propagate more specific feedback to the network. The learning rates are updated with a variable rate of change based on the performance of the network so that feedback can be used most efficiently. This novel training technique is tested on the MNIST and CIFAR-10 databases and was found to achieve faster training and improved accuracy.
This paper begins with the behavioral psychology foundation for the DVLR method in the Background section. In the Method section, the specific differences between DVLR and backpropagation are discussed. The Baselines and Models section presents the experimental setup, and the Results section analyzes how the method performed on the MNIST and CIFAR-10 databases. The Related Work section reviews similar work in machine learning. Finally, directions for future work are outlined in the Discussion and Future Work section.
Behavioral Psychology focuses on how the subject learns by observing its behavior instead of attempting to explain a subject’s thought process. Through experiments, behavioral psychologists can identify what the subject is capable of learning, and the best ways to facilitate or inhibit that learning. Learning, in their context, is an enduring change in the mechanisms of behavior with specific stimuli and responses that result from prior experience. This focus on behavior instead of thought is key to the methodology in this paper. Computer scientists do not fully understand why a neural network produces the responses it does, especially as networks become more complicated. As such, it is difficult to determine what needs to change in the neural network to increase accuracy. Behavioral psychologists have found that by using theories of learning and observing behavior, it is not necessary to understand how a subject is thinking to understand the subject’s current knowledge and to facilitate knowledge growth. Using this idea, DVLR attempts to use a network’s behavior to increase its learning and accuracy.
This idea of creating theories by observing behavior can be traced to psychologists Edward Thorndike and B.F. Skinner. Thorndike studied animal intelligence with the use of puzzle boxes and with this research discovered the Law of Effect Thorndike Bruce (2000)
. He determined that every response or change in response of an animal is the result of an interaction with the environment. Thorndike rejected randomness in animal actions, and determined that they must be able to form associations just as humans do. The Law of Effect states that the satisfaction or dissatisfaction that the animal receives from an action it performs directly determines if the animal will perform that action again. If the result of an action is favorable, the animal is more likely to perform it; if the result of an action is unfavorable, the animal is less likely to perform it. By providing both favorable and unfavorable feedback to an animal subject, is is possible teach it to perform or not perform certain actions. Thorndike also determined that neurons must modify their synapses under the same law. Neurons strengthen the synapses that are favorable to the neuron’s life processes and weaken the synapses that are a hindrance to its life processes. This observation is part of the motivation for the DVLR method in this paper.
Skinner studied how subjects perform with reinforcement over time, and how various schedules affected the subject’s performance Sherrick . (1959). Through thorough experimentation, Skinner defined four different types of schedules: fixed ratio, variable ratio, fixed interval, and variable interval. DVLR uses a method based on the variable ratio (VR) schedule of reinforcement where the subject is reinforced after a variable number of responses. This choice is due to Skinner’s conclusion that VR schedules led to the subjects performing tasks the fastest and the longest without pause. Similarly in neural networks, the goal is to create the most efficient networks in the least amount of time to solve a specific problem. Building on Skinner’s schedules, the learning rates in DVLR are updated over time to change the amount of emphasis a correct or incorrect response has on the network. The emphasis changes with a variable ratio schedule that is dependent on the number of correct or incorrect responses the network generates. Using ideas from the Law of Effect and VR schedules of reinforcement, DVLR implements dual learning rates on variable schedules as will be discussed next.
DVLR is an extension of the standard gradient descent update method in neural networks. There are two key changes that will be discussed in detail: dual learning rates and learning rate updates.
3.1 Dual Learning Rates
In DVLR, two learning rates are used: for correct responses, and for incorrect responses. By splitting up the correct and incorrect responses it is possible to provide different amounts of feedback to the network based on whether its response is favorable or unfavorable. The hypothesis of this method is that by providing more emphasis for the incorrect responses over time and less emphasis for the correct responses over time, the network will have access to more efficient feedback and in turn, will learn the ideal weight values faster. Additionally, as the experiment runs, there will be more emphasis on the incorrect errors and the network might discover key nuances in the data that were not previously obtainable. Thus, the network should have increased speed and accuracy with the DVLR method.
To make the dual learning rate implementation practical, batching was used, where batched responses are a mixture of correct and incorrect responses. Theoretically, the correct or incorrect learning rate would be determined for each response, but this approach is computationally expensive and does not provide a major advantage. Instead, if the majority of responses in a batch are correct, is used and if the majority of responses in a batch are incorrect, is used. In preliminary experiments, such batching turned out more efficient than selecting a learning rate for each example.
3.2 Learning Rate Updates
The key idea for DVLR’s update method is that the learning rate changes the amount of emphasis the error has on the network’s weight update. In preliminary experiments, several changes in and over time were obtained to discover whether a change in emphasis based on network performance affected the overall accuracy of the network.
The preliminary experiments led to a variable ratio threshold and rate of change in DVLR. A learning rate is updated once the number of correct or incorrect responses reaches the threshold. This method is similar to a variable ratio (VR) schedule of reinforcement in behavioral psychology with one difference. In a VR schedule, reinforcement is only given once the subject reaches the threshold, whereas in DVLR, feedback (in the form of gradient) is provided after every example. This difference is due to the inherent nature of neural networks: if gradients were not provided for every example, they would not have any influence on learning.
An example of a learning rate update is shown in Figure 1 for ; an analogous method is used for . As demonstrated in the figure, a random number within the range (45-55 in this example) is chosen as the threshold. As the network works through the examples from the dataset, the number of correct responses is counted. Then, once this number reaches the threshold, the learning rate is updated, the count is reset to zero, and a new threshold is randomly chosen within the range. This update method continues for the entire span of the experiment. In figure 1, the learning rate was decreased, but the update direction and magnitude varied in the DVLR experiments as described in the next section.
4 Baselines and Models
Two different baselines and three different update models of dual variable learning rates were evaluated. The Static-Simple (S-S) Baseline uses the standard, simple learning rate, , of normal gradient descent. Its values were determined to be effective in preliminary experiments. The Static-Dual (S-D) Baseline has two static learning rates: the rate of correct responses, , is slightly higher than the single learning rate () and the rate of incorrect responses, , is slightly lower than .
The Towards-Single (T-S) model begins with at a value slightly above and at a value slightly below . Over the course of the experiment, decreases towards and increases towards
. As a heuristic, the starting values of the learning rates and rate of change were chosen so that the final values ofand are as close as possible to . This heuristic was found to be effective in preliminary experiments.
For comparison, the Away-Single (A-S) and From-Single (F-S) models use the same variable ratio and rate of change as the T-S model. The A-S model additionally uses the same starting values for and as the T-S model, but instead of moving towards , they move away from it. Finally, the F-S model starts and at the value and increase while decreasing .
These specific models were designed to determine what function of variable rate change is best suited for DVLR. Example learning rate functions for the different baselines are shown in Figure 2 and example learning rate functions for the different models are shown in Figure 3.
The DVLR method was tested on two different databases: MNIST and CIFAR-10. Experiments were run on networks that are best suited for each database to get the best possible baselines before experimenting with DVLR. The MNIST experiments use a feedforward neural network with 500 hidden nodes, RELU activation function, and batch size of 10. The CIFAR-10 experiments use a convolutional neural network with two convolution layers, pooling layer, three fully connected linear layers, RELU activation function, and batch size of 10. Both networks were trained with cross-entropy loss and the Adagrad optimizer. All baseline and experimental results were obtained over 10 trials and averaged. All code and original data can be found at:
The DVLR method was tested on two different databases: MNIST and CIFAR-10. Experiments were run on networks that are best suited for each database to get the best possible baselines before experimenting with DVLR. The MNIST experiments use a feedforward neural network with 500 hidden nodes, RELU activation function, and batch size of 10. The CIFAR-10 experiments use a convolutional neural network with two convolution layers, pooling layer, three fully connected linear layers, RELU activation function, and batch size of 10. Both networks were trained with cross-entropy loss and the Adagrad optimizer. All baseline and experimental results were obtained over 10 trials and averaged. All code and original data can be found at:https://github.com/e-liner/NN-VR-LR.
5.1 MNIST Results
|Method||Avg. Train||Train||Avg. Test||Test|
|Static-Single (S-S) Baseline,||98.84||99.91||97.93||98.21|
|Static-Dual (S-D) Baselines:|
|VR195-205. 0% dec, 0% inc. ,||98.92||99.95||97.98||98.22|
|VR395-405. 0% dec, 0% inc. ,||98.93||99.94||97.94||98.20|
|Towards-Single (T-S) Model:|
|VR195-205. 0.01% dec, 1% inc. ,||98.93||99.94||97.98||98.22|
|VR395-405. 0.0125% dec, 2% inc. ,||98.90||99.93||97.94||98.24|
|Away-Single (A-S) Model:|
|VR195-205. 0.01% inc, 1% dec. ,||98.94||99.96||97.92||98.19|
|VR395-405. 0.0125% inc, 2% dec. ,||98.92||99.96||97.98||98.22|
|From-Single (F-S) Model:|
|VR195-205. 0.01% * 0.07 dec, 1% * 0.04 inc.||98.85||99.87||97.98||98.24|
|VR395-405. 0.0125% * 0.06 dec, 2% * 0.04 inc.||98.83||99.87||97.93||98.22|
The MNIST experiments focused on two different variable ratios: VR195-205 and VR395-405. As discussed in the previous section, the percentage rate of change and starting values for the dual learning rates were chosen heuristically. In the VR195-205 experiments, the F-S model performed the best out of the three models with a testing accuracy of 98.24%. In the VR395-405 experiments, the T-S model performed the best of of the three models with a testing accuracy of 98.24%. The average training values, final training values, average testing values, and final testing values can be found in Table 1. The average values are included for both training and testing to compare the performance of DVLR to the baselines over the entire span of the experiments.
|S-S vs. S-D 0.7/0.4||0.65|
|S-S vs. VR195-205 T-S||0.68|
|S-S vs. VR195-205 A-S||0.58|
|S-S vs. VR195-205 F-S||0.36|
|S-D 0.7/0.4 vs. VR195-205 T-S||0.96|
|S-D 0.7/0.4 vs. VR195-205 A-S||0.26|
|S-D 0.7/0.4 vs. VR195-205 F-S||0.62|
|S-S vs. S-D 0.6/0.4||0.84|
|S-S vs. VR395-405 T-S||0.30|
|S-S vs. VR395-405 A-S||0.06|
|S-S vs. VR395-405 F-S||0.65|
|S-D 0.6/0.4 vs. VR395-405 T-S||0.07|
|S-D 0.6/0.4 vs. VR395-405 A-S||0.37|
|S-D 0.6/0.4 vs. VR395-405 F-S||0.36|
The accuracy differences in the MNIST tests were very slight, but it is important to note that the T-S model, F-S model, and S-D baseline performed better than the S-S baseline in the VR195-205 experiments, and the T-S model, A-S model, and F-S model all performed better than the both the S-S baseline and S-D baseline in the VR395-405 experiments. These differences show that the dual learning rate method can increase accuracy of a simple feed-forward network, but also that the direction and rate of change is important. Additionally, the T-S model, A-S model, and S-D baselines all had increased average training accuracy as compared to the S-S baseline demonstrating that the dual learning rate method trains a simple feedforward network faster.
To determine statistical significance, t-test p-values were calculated for all experiments against the S-S and S-D baselines (Table 2). Most of the differences are not statistically significant. However, both the S-S vs. VR395-405 A-S and the S-D vs. VR395-405 T-S were close to significant difference. This result suggests that the VR395-405 A-S and VR395-405 T-S models are promising compared to the baselines.
The experiments suggest that both the dual learning rate and the variable ratio update method can be used to increase speed and accuracy of neural networks that use gradient descent as their update method. These conclusions are even stronger when scaling up to larger networks and datasets, as will be discussed next.
|Method||Avg. Train||Train||Avg. Test||Test|
|Static-Singular (S-S) Baseline||58.40||64.90||56.97||60.46|
|Static-Dual (S-D) Baselines:|
|VR245-252, 0% dec, 0% inc. ,||60.28||67.93||58.61||62.20|
|VR495-505, 0% dec, 0% inc. ,||59.94||67.47||58.39||61.86|
|VR745-755, 0% dec, 0% inc. ,||59.26||66.80||58.29||61.82|
|Towards-Single (T-S) Model:|
|VR245-252, 0.04% dec, 0.25% inc. ,||60.32||67.88||58.66||62.07|
|VR495-505, 0.04% dec, 0.4% inc. ,||59.94||67.06||58.47||61.81|
|VR745-755, 0.04% dec, 0.225% inc. ,||59.26||66.30||57.87||61.43|
|Away-Single (A-S) Model:|
|VR245-252, 0.04% inc 0.25% dec. ,||59.81||67.70||58.39||62.21|
|VR495-505, 0.04% inc, 0.4% dec. ,||60.16||67.46||58.78||62.49|
|VR745-755, 0.04% inc, 0.225% dec. ,||59.63||66.79||58.00||61.34|
|From-Single (F-S) Model:|
|VR245-252, 0.04% * 0.05 dec, 0.25% * 0.015 inc.||59.05||64.51||57.78||60.17|
|VR495-505, 0.04% * 0.035 dec, 0.4% * 0.015 inc.||59.23||66.05||59.94||61.09|
|VR745-755, 0.04% * 0.03 dec, 0.225% * 0.02 inc.||59.47||66.41||59.26||61.60|
5.2 CIFAR-10 Results
The CIFAR-10 experiments focused on three different variable ratios: VR245-252, VR495-505, and VR745-755. As discussed in the previous section, the percentage rate of change and starting values for the dual learning rates were chosen heuristically. In the VR245-252 experiments, the A-S model performed slightly better than the T-S model, F-S model, S-D baseline, and S-S baseline with a testing accuracy of 62.21%. In the VR495-505 experiments, the A-S model again performed the best with a testing accuracy of 62.49%. The average training values, final training values, average testing values, and final testing values can be found in Table 3. The average values are included for both training and testing to compare the performance of DVLR to the baselines over the entire span of the experiments.
In the experiments, the T-S model, A-S model, and S-D baseline performed better than the S-S baseline with all three variable ratios. These results show that the dual learning rate method increases the accuracy of a convolutional neural network. The F-S model performed worse than the S-D model in all experiments, and performed worse than the S-S model in the VR245-252 experiment. None of the VR745-755 experiments outperformed the S-D baselines, which suggests that the variable ratio method only works with some values. In the VR245-255 and VR495-505 experiments, the A-S model performed better than both baselines while the T-S model performed better than the S-S baseline only. This result confirms that with the right fine-tuning of the variable ratio and percent rate of change, DVLR can generate more accurate convolutional neural networks.
Statistical significance was determined for all experiments against the S-S and S-D baselines (Table 4). The VR245-252 and VR495-505 experiments were shown to be more significantly different than the VR745-755 experiments. In the VR245-252 comparisons, T-S, A-S, and S-D test accuracies were found to be significantly different than S-S. The F-S model was also found to be significantly different than S-D. Additionally in the VR495-505 comparisons, the A-S test accuracies were found significantly different than S-S. In contrast, the p-values between the DVLR results and the S-D baselines were not found to be significantly different. This results suggests that there is a significant increase with the dual learning rate method compared to a standard single learning rate.
|S-S vs. S-D 0.5/0.015||0.02|
|S-S vs. VR245-252 T-S||0.03|
|S-S vs. VR245-252 A-S||0.03|
|S-S vs. VR245-252 F-S||0.70|
|S-D 0.5/0.15 vs. VR245-252 T-S||0.74|
|S-D 0.5/0.15 vs. VR245-252 A-S||0.98|
|S-D 0.5/0.15 vs. VR245-252 F-S||0.00|
|S-S vs. S-D 0.035/0.015||0.06|
|S-S vs. VR495-505 T-S||0.07|
|S-S vs. VR495-505 A-S||0.02|
|S-S vs. VR495-505 F-S||0.40|
|S-D 0.035/0.015 vs. VR495-505 T-S||0.92|
|S-D 0.035/0.015 vs. VR495-505 A-S||0.23|
|S-D 0.035/0.015 vs. VR495-505 F-S||0.10|
|S-S vs. S-D 0.03/0.02||0.10|
|S-S vs. VR745-755 T-S||0.18|
|S-S vs. VR745-755 A-S||0.29|
|S-S vs. VR745-755 F-S||0.13|
|S-D 0.03/0.02 vs. VR745-755 T-S||0.47|
|S-D 0.03/0.02 vs. VR745-755 A-S||0.47|
|S-D 0.03/0.02 vs. VR745-755 F-S||0.69|
The experiments suggest that both the dual learning rate and variable ratio update method can be used to increase the speed and accuracy of larger neural networks. The CIFAR-10 experimental improvements are more pronounced than the MNIST experimental improvements, suggesting that DVLR should scale up well to larger architectures and datasets.
6 Related Work
In the brain, amygdala and ventral striatum work together to facilitate reinforcement learningAverbeck (2017). The amygdala has a faster learning rate than the ventral striatum and Averbeck concluded that having multiple neural systems learn at different rates facilitates more effective learning in dynamic environments.
There is also prior computational work in using different learning rates for the different parameters of a neural network Kim . (1995)
. Kim, et al. assigned a distinct learning rate to each reference vector in their vector quantization model and updated the reference vectors with a competitive learning method. As with DVLR, the networks performed faster and more accurately when using more than one learning rate for the network. The main difference from DVLR is that their method uses one learning rate for each reference vector which increases the number of parameters significantly.
Smith (2017) used a non-stationary learning rate that cycles between reasonable boundary values. He was able to achieve a significant increase in accuracy on the CIFAR-10 database. DVLR takes this idea one step further by introducing the insights from behavioral psychology to determine how the learning rates should change as a function of its performance.
7 Discussion and Future Work
In both the MNIST and CIFAR-10 experiments, DVLR was able to achieve a higher accuracy with dual variable learning rates. More importantly, the results were stronger in the CIFAR-10 tests as compared to the MNIST tests. This is promising as CIFAR-10 is a larger dataset than MNIST and the network used in the CIFAR-10 tests is a larger, more complicated architecture. As such, a promising direction for future work is to test out DVLR with larger architectures such as WRN and DenseNet and in larger domains like natural language processing where neural networks that use gradient descent are commonly used.
DVLR makes two contributions. First, it takes advantage of dual learning rates, and , that correspond to the network’s correct and incorrect responses. Second, it demonstrates that their effect is increased even further with variable ratio update schedules. These two techniques combine in DVLR, a new training technique for neural networks that is motivated by behavioral psychology. DVLR was tested on feedforward networks with the MNIST dataset and on convolutional networks with the CIFAR-10 dataset and resulted in faster training and improved accuracy on both networks. Moreover, it was found to be more powerful in larger architectures and datasets, making it a promising technique for the future.
- Averbeck (2017) Averbeck2017aAverbeck, B. 201711. Amygdala and ventral striatum population codes implement multiple learning rates for reinforcement learning. Amygdala and ventral striatum population codes implement multiple learning rates for reinforcement learning. 10.1109/SSCI.2017.8285354
Kim . (1995)
Kim1995aKim, C., Cho, S. Lee, C.
Fast competitive learning with classified learning rates for vector quantization Fast competitive learning with classified learning rates for vector quantization.Signal Processing-Image Communication6499-505. 10.1016/0923-5965(94)00032-E
- Sherrick . (1959) Skinner1959aSherrick, C., Ferster, C. Skinner, B. 195906. Schedules of Reinforcement Schedules of reinforcement. The American Journal of Psychology72320. 10.2307/1419391
- Smith (2017) Smith2017aSmith, L. 201703. Cyclical Learning Rates for Training Neural Networks Cyclical learning rates for training neural networks. ( 464-472). 10.1109/WACV.2017.58
- Thorndike Bruce (2000) Thorndike2000aThorndike, E. Bruce, D. 2000. Animal Intelligence: Experimental Studies Animal intelligence: Experimental studies. New Brunswick, NJTransaction Publishers.