On Study of the Binarized Deep Neural Network for Image Classification

02/24/2016 ∙ by Song Wang, et al. ∙ FUJITSU 0

Recently, the deep neural network (derived from the artificial neural network) has attracted many researchers' attention by its outstanding performance. However, since this network requires high-performance GPUs and large storage, it is very hard to use it on individual devices. In order to improve the deep neural network, many trials have been made by refining the network structure or training strategy. Unlike those trials, in this paper, we focused on the basic propagation function of the artificial neural network and proposed the binarized deep neural network. This network is a pure binary system, in which all the values and calculations are binarized. As a result, our network can save a lot of computational resource and storage. Therefore, it is possible to use it on various devices. Moreover, the experimental results proved the feasibility of the proposed network.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The research of artificial neural networks (ANN) began more than 70 years ago, proposed by Warren McCulloch and Walter Pitts [22], Donald Hebb [12] and Frank Rosenblatt [25]. Especially in [25]

, a two-layer network is introduced for pattern recognition. Figure 


shows the basic function of ANN, that is, the neuron of higher layer is calculated by the neurons of prior layer with the connecting weights. However, there was no solution for the network training until backpropagation (gradient descent) algorithm was created by Paul Werbos 

[30]. After James McClelland [26]

introduced the ANN as simulation of natural neural process and its usage in artificial intelligence (AI), the research of ANN became popular. Since then, ANN was successfully applied to image classification 

[1, 2, 5], character recognition  [24]

, face recognition 

[20], speech recognition [28] and so on. Moreover, a theoretical explanation of ANN’s success was also given by Kurt Hornik [18].

Figure 1: The basic function of ANN.

Because of the limitation of computers, the ANN research stagnated until the new century came. With the development of computers, the training of large-scale neural network became possible. As a result, the research of deep neural networks (DNN) emerged and attracted more and more attention. For example, Geoff Hinton’s deep belief nets [13]

, Yann LeCun and Dan Ciresan’s study on deep convolutional neural networks 

[21, 4]

, and Alex Graves’s deep recurrent neural networks 

[9]. The features of DNN can be summarized as follows.

  • Large-scale network: the DNN model usually has very large structure (with many layers), including millions of neurons and connections. Consequently, the computational cost of DNN is extremely high, thus most of the DNN experiments are done with GPUs.

  • Better performance: DNN archived the state-of-the-art results on various tasks and competitions, and brought breakthroughs to those fields, such as handwritten character recognition [7, 10], image classification [4, 3], speech recognition [8] and so on.

  • Lack of theoretical explanation: although the DNN achieved much better performance than conventional methods, there is no convincing explanation for such success. For example, it is well known that some trick in training could improve the performance of DNN significantly, but it was hard to explain such effect theoretically.

Obviously, although the performance of DNN is promising, it still needs to be improved in different ways. On one hand, in order to pursue higher recognition rate, several optimization methods for training were proposed, such as dropout [14] and dropconnect [29]. Those methods can reduce the overfitting problem significantly. On the other hand, new understanding of the neural network emerges and may extend the ability of DNN. For example, in Ian Goodfellow’s recent work [6] for digit string recognition, the output layer is trained to show both the digit number and the recognition result of each digit. By doing this, their DNN model is able to recognize the digit string directly, without any segmentation process. This work extended the DNN from single character recognition to character string recognition. Inspired by this work, we may change the basic framework of DNN to find more possibilities.

In this paper, we focus on the basic function of ANN and try to make it more suitable for computers. Usually, the ANN is seen as simulation of natural neural process, thus its neurons and weights are all real number. Such values can represent the electric signal generated by neural cells. However, the ANN models are often realized by computers. As we all know, the computer process data based on binary value, like “0” and “1”. In other words, we can say that the basic “neural cell” of computer generates binary signals. Inspired by this observation, we proposed a new type of neural network — the binarized deep neural network (BDNN). In BDNN, all the neurons and weights are binary value; at the same time, the calculation of the basic function of BDNN is also Boolean.

Actually, there were researches of binary neural network (Hopfield neural network) [15, 16, 17, 23, 27] and corresponding training algorithms [11, 19]. Nevertheless, this kind of neural network is quite different from BDNN. Although the input and output of the binary neural network are binary values, the weights of which are real number. Therefore, the calculation of binary neural network is not different from the conventional neural network. In contrast, the BDNN is a pure binary system, in which all the variables and operations are all binarized (Boolean).

Compared with conventional deep neural network, the BDNN is expected to have several promising merits, which are shown as follows.

  • Less storage request: since all the weights in BDNN are binary value, thus we can use only 1 bit to store one weight. As a result, we can save a large network with small storage. In contrast, we must use at least 16 bits to store one real number weight of conventional neural network.

  • Higher speed on CPU: the basic calculation of CPU is the Boolean calculation. For CPU, this calculation is the most efficient. Since the BDNN only uses Boolean calculation, thus the processing speed of which can be easily optimized on CPU. Consequently, it is possible to run BDNN of high speed on devices with only CPUs. In contrast, conventional DNN can only run fast with GPUs.

  • Clear network response: in BDNN, it is quite easy to observe the response of neurons and weights in the propagation process. This is because each neuron or weight only has two different kinds of status. This may be helpful for us to design proper learning strategy for BDNN.

In summary, with binary variables and Boolean operations, the BDNN is able to run with reasonable computational resource and storage. Although now the performance of BDNN may not be comparable with the conventional deep neural network of the same scale, it has the potential to be improved in the future.

This paper is organized as follows. In Section 2, the principles of the BDNN is introduced as well as the hybrid binarized deep neural network (hybrid-BDNN) for non-binary input data. In Section 3, a training method is proposed for BDNN and hybrid-BDNN. In Section 4, comparison experiments of BDNN and conventional DNN are analyzed. The last section is the conclusion part.

2 The binarized deep neural network

In this section, the recognition process (forward propagation) of BDNN is introduced. As mentioned above, only binary variables and Boolean operations are used in this process. Moreover, in order to use BDNN on non-binary input data, we also introduced the hybrid-BDNN, which contained both conventional neural network part and BDNN part.

2.1 The basic function for BDNN

As shown in Fig. 1, with this basic function, we can build network of any complicated structure. Therefore, the BDNN is actually a new definition of the basic function. Just like the conventional DNN, with the basic function of BDNN, any neural network structure can be built.

Usually, a basic function of neural network should contain two different types of calculation: the linear and nonlinear calculations. For example, as shown in Fig. 1, in conventional neural network, the linear calculation is an inner product of the input neurons and the corresponding weights , that is, the

; the nonlinear calculation of conventional neural network is activation function (sigmoid, hyperbolic tangent, etc.) or pooling (often used in DNN). Consequently, in the definition of the basic function of BDNN, we also defined both the linear and nonlinear calculations.

First, since the BDNN is created by binary values, the basic calculation should be chosen from Boolean algebra. The Boolean binary operations are “and ”, “or ” and several derived operations (“exclusive or ”, “equivalence ”, “material implication ”). The truth table of those operations is shown in Table 1. Obviously, the “exclusive or” and “equivalence” should be chosen because their results are balanced between “0” and “1”. In BDNN, we chose the “equivalence” as the linear calculation. If we use to represent “0” and to represent “1”, then the result of “equivalence” is equal to the multiplication of real number. For convenience, from here we will treat the binary values as real numbers and use the corresponding real number operation instead of the Boolean operation (just like the “equivalence” to multiplication). As shown in Fig. 1, assume that are the input neurons of BDNN and are the corresponding weights, is the output neuron, then we have . As mentioned above, the linear calculation of BDNN is

Table 1: Truth table of Boolean operations.

Second, the nonlinear calculation of BDNN is defined as follows: we count the number of “” and “” in and let the calculation return the one with larger number. Assume is the basic function of BDNN, with the linear and nonlinear calculation defined above, then we have


Figure 2 shows an example of the calculation of (2.1). The inputs are and the corresponding weights are . Then, we use Boolean operation as linear calculation and the results are obtained. Finally, since in the results, there are two “” and one “”, thus the value of output is “”. Please note that although (2.1) is written as real number function, it denotes the Boolean operation of “equivalence” and a counting operation. With (2.1), the forward propagation of BDNN can be realized. For a certain set of input binary data, the BDNN can calculate its corresponding output, which is also binary.

Figure 2: The calculation of BDNN basic function .

2.2 The hybrid-BDNN

Clearly, the BDNN introduced above is only able to process binary input data. If the input data is non-binary, such as grayscale image, the BDNN can not be used directly. Therefore, we should first convert the non-binary data into binary before the BDNN is used. For such situation, we proposed the hybrid-BDNN, which is a combination of conventional neural network and BDNN.

As shown in Fig. 3, the Hybrid-BDNN contains three parts: the normal neural network part, the transition part and the BDNN part. The lower layers of hybrid-BDNN are normal neural network part, which is connected to the input data (binary or non-binary); the higher layers are the BDNN part, which generates the result. The transition part is a single layer between the normal neural network part and BDNN part, by which the two different neural networks are combined.

Figure 3: The structure of hyrbid-BDNN.

The basic function of the normal neural network part is the same with the conventional networks (inner product and activation function); the basic funtion of BDNN is just introduced above. Consequently, the forward propagation can be conducted in both the normal neural network and the BDNN part. The remaining problem is the transition part. If we define the basic function of the transition part, then we can conduct the forward propagation of the whole hybrid-BDNN.

As shown in Fig.3, assume that are the input neurons of the transition part and the are the output neurons. As mentioned above, belongs to the normal neural network part, thus it is all real number. The belongs to the BDNN part, so they are all binary values. Let denote the basic function of transition part, if we take as example and are the corresponding weights, then we obtain

Obviously, in order to calculate with , the is also real number. Then we define as follows:


where is activation function and is a fixed threshold. In fact, the calculation is just the basic function of the normal neural network part. The (2.2) means that we use a threshold to convert the output real number of the normal neural network part into binary value.

With (2.2), the forward propagation of the whole hybrid-BDNN can be realized. This network can be used for any input data. By the way, if the range of activation function is , then we can set the threshold . This is convenient for the training.

In summary, the BDNN is very suitable for binary input data classification, such as binary image of characters. If the input data is non-binary, the hybrid-BDNN can be used. Please note that it is better not to use too many layers in normal neural network part, otherwise the computation speed of hybrid-BDNN may be slowed much. In the experiments, we only used one layer (including the input data) as the normal neural network part.

3 Training method for BDNN and hybrid-BDNN

Usually, the gradient descent algorithm is seen as the training method for neural networks. For the training of BDNN and hybrid-BDNN, we also used this algorithm. However, compared with the conventional neural network, the BDNN has obviously different properties, thus we can not directly use this algorithm on BDNN training. In order to solve this problem, some approximation and conversion are applied to BDNN to make it suitable for gradient descent training.

3.1 Gradient descent training for BDNN

In the conventional gradient descent training, in each iteration, the weights are adjusted by a small value (depends on the error propagation and learning rate). Nevertheless, since the weights in BDNN are binarized, thus it is hard to adjust the weights in the same way. Hence, in order to apply the gradient descent algorithm, in the training of BDNN, the real numbers are used instead of the binary values. A conversion function is defined to convert the real numbers to corresponding binary values, which is


With function (3.1), we can convert the trained weights of real number to binary value.

Figure 4: The curve of output neuron by a certain input neuron and its approximation.

However, the key point of using real number is that we must keep the forward propagation result the same with the binarized network; otherwise, the training is meaningless. Therefore, in order to satisfy this request, a basic function for training is defined. First, assume , and are the input neurons, weights and output of the BDNN; , and are the corresponding real numbers of training. Second, assume


Then, the basic function of training is given by


As mentioned above, in (2.1), the basic function of BDNN counts the number of “1” and “-1” in and returns the one with larger number. Accordingly, in (3.3), for the basic function , if in the positive values are more than negative ones, then is a positive value; otherwise, is negative. Consequently, because of (3.2), we obtain


Clearly, with (3.3), if the inputs of training are the same with BDNN, then after forward propagation, each real number neuron of training equals to the corresponding binary neuron of BDNN (by using conversion function ), including the output neurons. As a result, we can train the BDNN with real numbers. After training, we just need to convert the real number weights into binary values and then a trained BDNN is obtained.

After the forward propagation of training is solved, the next problem is the backpropagation. In order to use gradient descent algorithm, we must define two partial derivatives: the and . With such two partial derivatives, the backpropagation can be conducted.

Assume is a certain neuron from and (let ) is its corresponding weight. Let take the form

Then, we have

It is obvious that and are independent. Thus we can draw a curve of by , which is shown in Fig. 4 (a). This curve is not continuous. We can see that the is at and at the rest of the positions. Clearly, we can not use such partial derivative for gradient descent algorithm. Therefore, an approximation of this curve (shown in Fig. 4 (b)) is used to calculate the . The new curve is a strait line, go through the point and . Here, we assume that . If , the slope of the approximation line will be reversed. Consequently, by using the approximation, we obtain


Similarly, since in function , the and are symmetric, so that


Here, in (3.4) and (3.5), the is the value of the slope; the or determines the sign of the slope. Please note that the slope value can be set to not only but also other proper values.

With (3.4) and (3.5), the gradient descent algorithm can be applied to train the BDNN. Although an approximation was used, the following experimental results showed that, the BDNN could be trained well with such method.

3.2 The training of transition part of hybrid-BDNN

As shown in Fig. 3, now the backpropagation can be applied to the normal neural network part and the BDNN part. However, in order to apply the backpropagation on hybrid-BDNN, we still need to consider about the backpropagation of the transition part.

First, the same with the BDNN training, in the training of the transition part, the real numbers are used instead of the binary values. Assume that is the corresponding real number of and function is the real number version of function . Then, the same with , in order to keep the same forward propagation result, should satisfy the following equation

Therefore, we define as


Obviously, the of (3.6) satisfies the request. If we set , then is just the activation function. Naturally, for a certain neuron and its weight , the two derivatives of the transition part are given by


In (3.7), the two partial derivatives are just the same with the normal neural network. With (3.7), the backpropagation of the transition part becomes possible.

Consequently, by using the above method, the backpropagation can be applied to the whole hybrid-BDNN. Now we can train the hybrid-BDNN by using the gradient descent algorithm.

3.3 Special training technique for BDNN

Clearly, the training method for BDNN proposed in this paper is very close to the conventional training method. Thus it is possible to borrow the training techniques of conventional neural networks to BDNN, such as dropout and dropconnect. Besides, there are several special training techniques, which are necessary for BDNN.

First, in the training, we must keep the of (3.3

) an odd number all the time. This is because the output

of is not defined. The result of should be either a positive number or a negative number. Consequently, in the following experiments, in the BDNN, the neuron number was always an odd number.

Second, in the training of conventional neural network, the error of the output is determined by the difference between the output neuron and its ground truth. However, in the training of BDNN, the error is determined by not only the difference but also the sign of each value. Assume the error of a certain output neuron is and the corresponding ground truth is , then the error calculation is given by


The(3.8) means that if the output neuron and its ground truth are both positive or negative, then the error of this output is ; otherwise, the error is calculated as conventional neural network. Actually, according to (3.1), if the output neuron has the same sign with its ground truth, then the output is seen as correct. This technique is essential for the BDNN training, without which the training can not converge.

4 Experiments

In the experiments, two different network structures were studied — the classical three-layer network and the convolutional neural network (CNN). Besides the BDNN and hybrid-BDNN, the conventional neural networks of the same structures were also tested for comparison. Since now we only have the basic training method for BDNN, so in order to make fair comparison, most of the training optimization techniques of the conventional neural network were not used. For example, in the experiments, the learning rate was fixed and the simple hyperbolic tangent function was used as activation function.

Moreover, two datasets, MNIST and CIFAR10 were used for the experiments. As we know, MNIST is a dataset of handwritten digits, thus it is suitable for binary data classification test. In contrast, the CIFAR10 was used as non-binary data, on which the hybrid-BDNN was tested. In each dataset, the training data was used to train the networks; the test data was used as validation data. After the training was finished, the iteration with the lowest error rate on test data was seen as the final result.

4.1 The classical three-layer network

In the early stages of ANN, before the DNN was introduced, the three-layer network was widely used for classification tasks. As shown in Fig. 5, in such network, there are three different layers: the input layer, the hidden layer and the output layer. The input layer contains the input data while the output layer generates the classification results. Besides, all the layers are fully connected. Since this network structure was very classic, so the BDNN of this structure was first tested.

Figure 5: The structure of three-layer network.
Figure 6: The network structure of CNN.

For the experiments of three-layer BDNN, the binary data of MNIST was used. Because the original images of MNIST were grayscale, so we used a simple binarization method (fixed threshold) to convert the grayscale images into binary. All the training data of MNIST was used to train the network and the test data of MNIST was used as validation data.

In MNIST, the image size is , thus there are 784 pixels, which are used as input neurons. As mentioned above, in order to make the neurons of the input layer an odd number, one neuron was simply added to the input layer and set to . Consequently, the input layer contains 785 neurons. Moreover, since the digits have 10 classes, so the output layer is set to 10 neurons.

For the hidden layer, we used two different numbers, one is 1571 (double of the input neuron number) and the other is 2355 (three times of the input neuron number). Then, two different network structures are obtained as follows.

  • Structure A: the input layer contains 785 neurons, followed by a hidden layer of 1571 neurons, then an output layer of 10 neurons (one neuron stands for one digit class).

  • Structure B: the input layer contains 785 neurons, followed by a hidden layer of 2355 neurons, then an output layer of 10 neurons (one neuron stands for one digit class).

In the experiments, the BDNN of both structures were tested. Meanwhile, the conventional ANN of Structure A was also tested for comparison.

  BDNN Normal ANN
  Train Test Train Test
Struct. A   20.73 16.74 1.07 1.64
Struct. B   16.89 14.50
Table 2: Error rates (%) of the three-layer networks on MNIST.

The experimental results are shown in Table 2. On one hand, by using the proposed training method, finally, the BDNN converged on the training dataset and on the test dataset, the classification rate was around . By this result, the feasibility of BDNN is proved. On the other hand, it can be seen that the conventional ANN of Structure A has much lower error rate than the BDNN of same structure. This is may be because the BDNN was not fully trained by the proposed method (borrowed from the conventional network training). If we can design specific training method for BDNN, the performance of which may be improved a lot. Another possible reason may be the complexity of the network. Since the binary neurons are used in BDNN, the complexity of which is lower than the conventional ANN, though they have the same structure. The result of BDNN of Structure B indicates that, when the scale of BDNN is increased (higher complexity), the performance becomes better.

4.2 The convolutional neural network

In deep learning research, the structure of CNN is widely used for image classification. Therefore, in the experiments, we also tested the CNN structure. As shown in Fig. 

6, we used a CNN structure of five layers. The first layer is the input data, followed by two layers of feature maps (calculated by the convolutional kernels). The last two layers are the fully connected neurons and one of them is used as output layer. In most of the CNNs, after convolutional operation, the feature map size is then decreased by pooling. Nevertheless, since it is hard to realize the pooling operation on binary values, thus the feature map size is controlled by skipping every other pixel in the convolutional operation.

With different sizes of the feature maps and kernels, three different CNN structures are obtained as follows.

  • Structure C: Layer 1 is the input image, the size is ; Layer 2 contains feature maps of size , obtained by using kernels of size ; Layer 3 contains feature maps of size , obtained by using kernels of size ; Layer 4 contains 201 neurons, obtained by using kernels of size ; Layer 5 contains 10 neurons for output.

  • Structure D: Layer 1 is the input image, the size is ; Layer 2 contains feature maps of size , obtained by using kernels of size ; Layer 3 contains feature maps of size , obtained by using kernels of size ; Layer 4 contains 201 neurons, obtained by using kernels of size ; Layer 5 contains 10 neurons for output.

  • Structure E: Layer 1 is the input image, the size is ; Layer 2 contains feature maps of size , obtained by using kernels of size ; Layer 3 contains feature maps of size , obtained by using kernels of size ; Layer 4 contains 201 neurons, obtained by using kernels of size ; Layer 5 contains 10 neurons for output.

The Structure C and D were used by BDNN and tested on MNIST. Structure E was used by the hybrid-BDNN and tested on CIFAR10 (non-binary data). Therefore, for Structure E of hybrid-BDNN, the input layer and the following kernels were real numbers (normal neural network part) and the rest of the network was the BDNN part.

First, the error rates of MNIST are shown in Table 3. Clearly, by using deeper structure (more layers), the performance of BDNN was improved. Meanwhile, the normal CNN of the same structure still had much better result. In Structure D, when more feature maps were used, the result of BDNN became better. This also shows the potential of BDNN of larger scales.

  BDNN Normal CNN
  Train Test Train Test
Struct. C   11.78 9.98 0.26 1.10
Struct. D   7.11 6.54
Table 3: Error rates (%) of the CNN structures on MNIST.

Second, the error rates of CIFAR10 are shown in Table 4. CIFAR10 is a database of color images of 10 different kinds of objects. In the experiment, the color images were converted into grayscale and then used. Since this database was very difficult, the error rates of both networks were very high, though the largest network structure were used.

Another observation is that although the training error rate of hybrid-BDNN was much lower than CNN, its test error rate was lower. This indicates that (i) it is possible to use hybrid-BDNN for non-binary data classification and (ii) larger network structure of high complexity is necessary for reducing the training error of hybrid-BDNN.

  Hybrid-BDNN Normal CNN
  Train Test Train Test
Struct. E   89.84 87.48 41.71 88.37
Table 4: Error rates (%) of the CNN structures on CIFAR10.

5 Conclusion and future work

In BDNN, we first tried the brand new binarized propagation function for neural networks. Moreover, a special gradient descent training method is also proposed for BDNN. The experimental results proved that it is able to use BDNN just like the conventional DNN. Besides changing the network structure and training strategy, our trial may provide new thought of improving the DNN and extend its usage.

Because of the computational limitation, we didn’t test the BDNN of large scale in this paper. Therefore, the performance of BDNN may not be comparable with the state-of-the-art results. In the future, we will develop GPU based training program to train the BDNN of larger scale. Various network structures will also be studied, such as the deep fully connected neural network, the recurrent neural network and so on. Moreover, the optimization of the forward propagation of the BDNN on CPU will be studied.

6 Acknowledgments

This work was started at the beginning of 2014 and based on this work, a patent (China) is applied in November, 2014. The application number of the patent is 201410647710.3. We also submitted this paper to CVPR 2015 but it was rejected. However, we still believe this work is valuable and the BDNN is a promising solution for low-performance device to use deep learning models.


  • [1] J. Benediktsson, P. H. Swain, and O. K. Ersoy. Neural network approaches versus statistical methods in classification of multisource remote sensing data. IEEE Transactions on geoscience and remote sensing, 28(4):540–552, 1990.
  • [2] H. Bischof, W. Schneider, and A. J. Pinz. Multispectral classification of landsat-images using neural networks. IEEE Transactions on Geoscience and Remote Sensing, 30(3):482–490, 1992.
  • [3] D. Ciresan, U. Meier, J. Masci, and J. Schmidhuber. Multi-column deep neural network for traffic sign classification. Neural Networks, 32:333–338, 2012.
  • [4] D. Ciresan, U. Meier, and J. Schmidhuber. Multi-column deep neural networks for image classification. In

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , pages 3642–3649, 2012.
  • [5] G. Giacinto and F. Roli. Design of effective neural network ensembles for image classification purposes. Image and Vision Computing, 19(9):699–707, 2001.
  • [6] I. J. Goodfellow, Y. Bulatov, J. Ibarz, S. Arnoud, and V. Shet. Multi-digit number recognition from street view imagery using deep convolutional neural networks. arXiv preprint arXiv:1312.6082, 2013.
  • [7] A. Graves, M. Liwicki, S. Fernández, B. Roman, B. Horst, and J. Schmidhuber. A novel connectionist system for unconstrained handwriting recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(5):855–868, 2009.
  • [8] A. Graves, A. Mohamed, and G. Hinton. Speech recognition with deep recurrent neural networks. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6645–6649, 2013.
  • [9] A. Graves and J. Schmidhuber. Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural Networks, 18(5):602–610, 2005.
  • [10] A. Graves and J. Schmidhuber. Offline handwriting recognition with multidimensional recurrent neural networks. In Advances in Neural Information Processing Systems, pages 545–552, 2009.
  • [11] D. L. Gray and A. N. Michel. A training algorithm for binary feedforward neural networks. IEEE Transactions on Neural Networks, 3(2):176–194, 1992.
  • [12] D. Hebb. The organization of behavior. New York: Wiley, 1949.
  • [13] G. Hinton, S. Osindero, and Y. W. Teh. A fast learning algorithm for deep belief nets. Neural computation, 18(7):1527–1554, 2006.
  • [14] G. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.
  • [15] J. J. Hopfield. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the national academy of sciences, 79(8):2554–2558, 1982.
  • [16] J. J. Hopfield. Neurons with graded response have collective computational properties like those of two-state neurons. Proceedings of the national academy of sciences, 81(10):3088–3092, 1984.
  • [17] J. J. Hopfield and D. W. Tank. Computing with neural circuits- a model. Science, 233(4764):625–633, 1986.
  • [18] K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366, 1989.
  • [19] J. H. Kim and S. K. Park. The geometrical learning of binary neural networks. IEEE Transactions on Neural Networks, 6(1):237–247, 1995.
  • [20] S. Lawrence, C. L. Giles, A. C. Tsoi, and A. D. Back. Face recognition: A convolutional neural-network approach. IEEE Transactions on Neural Networks, 8(1):98–113, 1997.
  • [21] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • [22] W. S. McCulloch and W. Pitts. A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics, 5(4):115–133, 1943.
  • [23] M. Muselli. On sequential construction of binary neural networks. IEEE Transactions on Neural Networks, 6(3):678–690, 1995.
  • [24] A. Rajavelu, M. T. Musavi, and M. V. Shirvaikar. A neural network approach to character recognition. Neural Networks, 2(5):387–393, 1989.
  • [25] F. Rosenblatt.

    The perceptron: A probalistic model for information storage and organization in the brain.

    Psychological Review, 65(6):386–408, 1958.
  • [26] D. E. Rumelhart and J. McClelland. Parallel distributed processing: Explorations in the microstructure of cognition. MIT Press, 1986.
  • [27] Y. Takefuji and K. C. Lee. An artificial hysteresis binary neuron: A model suppressing the oscillatory behaviors of neural dynamics. Biological Cybernetics, 64(5):353–356, 1991.
  • [28] A. Waibel. Modular construction of time-delay neural networks for speech recognition. Neural computation, 1(1):39–46, 1989.
  • [29] L. Wan, M. Zeiler, S. Zhang, and Y. LeCun. Regularization of neural networks using dropconnect. In

    Proceedings of the 30th International Conference on Machine Learning (ICML-13)

    , pages 1058–1066, 2013.
  • [30] P. Werbos. Beyond regression: New tools for prediction and analysis in the behavioral sciences. PhD thesis, Harvard University, 1975.