Single neuron-based neural networks are as efficient as dense deep neural networks in binary and multi-class recognition problems

05/28/2019 ∙ by Yassin Khalifa, et al. ∙ IEEE 0

Recent advances in neuroscience have revealed many principles about neural processing. In particular, many biological systems were found to reconfigure/recruit single neurons to generate multiple kinds of decisions. Such findings have the potential to advance our understanding of the design and optimization process of artificial neural networks. Previous work demonstrated that dense neural networks are needed to shape complex decision surfaces required for AI-level recognition tasks. We investigate the ability to model high dimensional recognition problems using single or several neurons networks that are relatively easier to train. By employing three datasets, we test the use of a population of single neuron networks in performing multi-class recognition tasks. Surprisingly, we find that sparse networks can be as efficient as dense networks in both binary and multi-class tasks. Moreover, single neuron networks demonstrate superior performance in binary classification scheme and competing results when combined for multi-class recognition.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Over the last several years, machine learning techniques, particularly neuromorphic computing architectures, have been used to mimic the challenging pattern recognition abilities of biological systems

[1]

. In fact, it could be argued that neural networks based systems are learning feature hierarchies automatically so that they are replacing many learning models that depend on hand-designed heuristics

[1, 2, 3]. This is achieved through multiple levels of representations generated by successive non-linear modules, each of which transforms low level representations into more abstract high level representations [1, 4, 5, 6, 7]. Theoretical results have shown that deep architectures may be needed to efficiently model complex functions and represent high levels of abstraction required for challenging recognition and AI tasks [8, 9, 5, 10].

Many questions come up when choosing an architecture to address a practical problem, including how many layers we should use and how large each layer is supposed to be [11, 12]. Network capacity increases with both number and size of hidden layers due to the fact that neurons collaborate to express complex functions and achieve better generalization [11]

. However, high capacity models may fit also dataset outliers and lead to overfitting. Although, one can always use shallow networks to avoid overfitting, but sometimes data complexity necessitates using deep architectures

[11, 13]. Multiple solutions have been suggested though to avoid overfitting in deep networks including dropout [14], regularization, and noisy inputs [11, 15, 13].

In practice, overfitting control is preferred over network size tuning as it was found that smaller networks are hard to train using local optimizers like gradient descent [13, 16, 17, 18]. Obviously, smaller networks will have fewer local solutions with easy convergence, however, most of these solutions are unreliable (high loss) especially when using batch optimization [13]

. This also makes small sized networks vulnerable to huge loss variance influenced by random initializations

[11]

. It is hard to prove this mathematically due to the poor understanding of loss function and its shape for neural networks

[13]. On the other hand, simplifying the problem through conquering into smaller problems or multiple recognition stages may help build efficient, yet cheap to train models.

Understanding the response of individual neurons in parts of brain that are highly believed to be responsible for visual object recognition, is quite challenging and hard to predict [19, 20, 21]. Nevertheless, it is known that they are activated by a set of complex visual features and, thus they cannot be narrowly tuned as detectors for specific objects [19, 22, 23]. Therefore, single neurons may not be acting as sparse object detectors, but, rather as elements of a group that as a whole, provides object recognition [19, 22]. However, they are often able to maintain preferences among objects like changes in size and shape which is called neural tolerance [19, 20]. This suggests that single neurons together decode the object space and identity including position, size, context, and other variables to achieve powerful representation and avoid binding this information at successive stages [19, 24, 25, 26]. As an analogy of how brain single neurons behave in visual recognition tasks, artificial neural networks can make use of it as a design concept in developing populations of single neuron or small-sized networks for object recognition.

The design concept introduced here, has proven effectiveness for many classifiers and configurations including SVM and neural networks

[27, 28, 29, 30]. The reason it is often successful for classification lies in the fact that it is easier to build classifiers to separate two classes rather than for multiple classes [28]. Usually, the approach is referred to as decomposition [28]

, binarization

[28], or modularization [30] and has been addressed in different ways. The first used decomposition strategy is called ”one-vs-one” (OVO) and it divides the problem into as many binary classification problems as the number of unique combinations between each two classes [28, 29]. The second strategy is called ”one-vs-all” (OVA) and it trains a classifier to discriminate each class from all other classes [28, 29].

Modularization in neural networks has been introduced by [29] through attempting to separate each class from all other classes and then subsequently pairwise separate each class using subnetworks. Furthermore, [30] used K networks to reduce a K-class problem into a set of K two class problems, however they focused more on the back-propagation algorithm and enhancing its convergence speed. In this work, we used an OVA approach as in [30] and we focused more about the size of modular networks used for each binary sub-classification problem and the performance of the network.

This paper addresses the feasibility of neuron population based design approach through answering three main questions: 1. How does the model performance change with the size of hidden layer in the used network? 2. Is it possible that single neuron or small-sized networks can achieve good performance with the binary sub-classification problems of high dimensional multi-class data? 3. Can binary classification networks with low number of hidden neurons (especially single neuron networks) be utilized to perform multi-class recognition? and what is the efficiency of these systems compared to multi-class dense systems? To answer these questions, we experimented multiple architectures along with different levels of abstraction in recognition tasks for three datasets.

Ii Methods

Fig. 1: Experimental setup overview. (A) Scatter view of random data drawn from . (B) X-Z projection of the data. (C) X-Y projection of the data. (D) Y-Z projection of the data. (E) Sample of the MNIST dataset. (F) Sample of the CIFAR-10 dataset. (G) Classification accuracy of networks with different hidden layer sizes for MNIST and CIFAR-10 data. (H) Training loss of MNIST dataset for different hidden layer sizes.(I) Training loss of CIFAR-10 dataset for different hidden layer sizes.

Ii-a Datasets

To conduct experimental results about the main objectives of this work, we used three main datasets for all trials. The first dataset is composed of 3 dimensional data points drawn from a normal distribution with different values of mean and standard deviation. Particularly, we used a combination of a single value for mean (0) and 3 values for standard deviation (0.1, 0.5, 1). Multiple numbers of data points were used as well for each set, which gives a total of nine sets (3 sets of different lengths drawn from each distribution). Two categories were created inside the generated sets through positively and negatively biasing samples of the data with the same amount. Fig.

1(a-d) shows a sample of the generated sets with a mean of 0 and standard deviation of 1. The categories for this set were created by adding to half of the set and subtracting the same value from the other half.

The second dataset is the famous MNIST [2], widely used for handwritten digit recognition benchmarking. MNIST consists of two sets of images, the first set is 60,000 examples for training and a test set of 10,000 examples. All digits are size-normalized and centered in gray scale images (Fig. 1(e)). MNIST dataset doesn’t require formatting or preprocessing which makes it optimal for testing learning techniques and new recognition algorithms [2]. The third dataset is the CIFAR-10 dataset [31] which consists of 60,000 color images each of pixels in classes (Fig. 1(f)). The dataset is divided into two sets, one for training (50,000 images) and another for testing (10,000 images).

Ii-B CNN Design

The main architecture of CNN used for all multi-class recognition experiments performed in this analysis, included two convolutional layers, two max pooling layers (one after each convolutional layer), a single hidden dense layer (ReLU activated), and an output layer with a number of units equal to the number of classes in the target dataset. The first convolutional layer applies 32

filters with ReLU activation followed by a

filter max pooling with strides of 2 (non overlapping pooled regions). The second convolutional layer applies 64

filters with ReLU activation followed by a filter max pooling with strides of 2. For the hidden layer, we used

units for the different experiments performed. All experiments developed for this analysis were implemented using Tensorflow

TM and Matlab; and tested over an NVIDIA Tesla M40 GPU.

Ii-C Effect of Layer Size

The number of hidden neurons used in different network architectures is assessed here through running multiple trials over the 3 datasets mentioned before. First, we trained three feedforward networks for a simple binary classification task over all of the randomly generated sets. Each set was divided into 80% for training and 20% for testing of its length. The three networks are of one hidden layer but one with a single neuron, the second with 10 neurons, and the last one with 100 neurons. The average of results was taken over multiple iterations (100 to 1000) to avoid randomness of the experiment [32].

Second, to see the effect of the number of hidden neurons on more complex recognition problems, we trained a convolutional neural network (CNN) for MNIST handwritten digit recognition and another for CIFAR-10 content recognition. Both networks have two convolutional layers, two pooling layers (one after each convolutional layer), and a single hidden dense layer (ReLU activated). The first convolutional layer applies 32

filters with ReLU activation followed by a filter max pooling with strides of 2 (non overlapping pooled regions). The second convolutional layer applies 64 filters with ReLU activation followed by a filter max pooling with strides of 2. For the hidden layer, we used in order to examine the effect of layer size on the recognition task in both datasets.

Ii-D Effect of Tearing down the Recognition Problem

As mentioned before, dense and deep networks may be needed to model complex recognition problems. However, this will come with a cost, overfitting and expensive training. So, here we test the ability of tearing down multi-class recognition problems into a series of binary classifiers. MNIST and CIFAR-10 datasets were used also for this experiment but with converting each of them into 10 different datasets from the labels point of view. In other words, the labels of each dataset were altered 10 times to classify only one category out of the 10 categories against all the others. For MNIST dataset we have 10 classification problems, the first is to classify the digit against , the second is to classify the digit against , and so on. The same also is performed for the CIFAR-10 dataset but considering its different categories. All generated datasets were balanced in terms of classes before being used in training.

We used the same convlutional network mentioned before and tried different sizes for hidden layer too. Ten networks were trained for each dataset and the results were compared with small sized, full classification networks tested with the same datasets. In order to use these parallel networks for 10 classes recognition, the input image is administered to each of the 10 networks and the resultant digit is represented by the network that gives a positive response as in Fig. 2. Due to the independence of the 10 networks, there might be redundancy in the output (i.e. an image identified more than once throughout the 10 networks). This can be solved through priority encoding the results (take the first positive result and ignore the rest of networks) or consider this as misclassification. For this study, we considered redundancy as misclassification. We saw it as a more convenient way for performance comparison between different architectures.

Fig. 2: Multi-class recognition using binary classification networks.
Network Size Single Neuron Net 10-Neurons Net 100-Neurons Net
Sample Size
Accuracy 0.693 0.727 0.703 0.771 0.784 0.782 0.753 0.778 0.782
Sensitivity 0.695 0.727 0.704 0.775 0.783 0.73 0.758 0.779 0.781
Specificity 0.691 0.728 0.702 0.768 0.786 0.782 0.750 0.777 0.782
Accuracy 0.706 0.698 0.700 0.772 0.781 0.782 0.760 0.778 0.782
Sensitivity 0.694 0.698 0.701 0.760 0.782 0.783 0.756 0.778 0.781
Specificity 0.720 0.698 0.700 0.786 0.782 0.780 0.765 0.778 0.782
Accuracy 0.701 0.6936 0.700 0.776 0.780 0.781 0.756 0.779 0.781
Sensitivity 0.699 0.693 0.700 0.775 0.785 0.781 0.767 0.777 0.781
Specificity 0.703 0.693 0.699 0.778 0.775 0.781 0.747 0.780 0.782
TABLE I: Performance comparison between single, 10, and 100 neurons networks over random generated data.

Iii Results

Changing the hidden layer size in a neural network will definitely affect the performance of the network, but, the question is whether the change in performance is worth it or not. Testing networks of different hidden layer sizes using the random generated data gave the results shown in Table I. On the other hand, testing over a wider range of hidden layer sizes for multi-class complex data showed that increasing the number of hidden neurons is not quite effective after a certain point. The accuracy of 10 classes recognition task for both MNIST and CIFAR-10 remains nearly constant after a hidden layer size of 128 neurons as shown in Fig. 1

(g). Probably this size will be different from a dataset to another and even from a recognition task to another, however, this shows that the same performance can be achieved with way smaller networks. No changes in the total loss pattern were observed as well after this layer size as shown in Fig.

1(h) and (i). In addition to that, small sized networks were found to suffer from high loss despite of faster convergence which can be clearly seen in CIFAR-10 total loss in Fig. 1(i).

Dataset MNIST/Single Neuron Net CIFAR-10/128 Neurons Net
Zero against all 0.982 Airplane against all 0.959
One against all 0.985 Automobile against all 0.976
Two against all 0.978 Bird against all 0.952
Three against all 0.960 Cat against all 0.948
Four against all 0.997 Deer against all 0.955
Five against all 0.985 Dog against all 0.944
Six against all 0.985 Frog against all 0.960
Seven against all 0.975 Horse against all 0.963
Eight against all 0.956 Ship against all 0.977

Accuracy

Nine against all 0.932 Truck against all 0.961
TABLE II: Performance of tearing down multi-class problems into binary classification with the use os small sized hidden layers.

Since over-reduction of network size alone, shows poor performance in multi-class recognition problems as it appears in Fig. 1(g), simplifying the recognition problem might be the solution. Given the performance of single neuron networks for binary classification in Table I, we test the ability of tearing down multi-class recognition problems into a set of binary classifiers using small sized networks as in Fig. 2. Table II shows a sample of the results for tearing down the MNIST and CIFAR-10 datasets into 10 different binary classification problems. It seems that using a small sized hidden layer along with a binary problem gives better results than the best performance achieved using high density layers used with multi-class recognition problems considering all classes. For MNIST problem, good results were achieved using a single neuron in hidden layer. On the other hand, CIFAR-10 classification accuracy did not jump over 80% before using 64 neurons in the hidden layer.

The 10 networks trained for binary classification problems for MNIST were combined together to form a 10-classes recognizer as in Fig. 2. The new system correctly identified 84.21% of the test data, produced multiple classification results for 12.79% of the test data (maximum of two positive results per image appeared), and the rest of test data (3%) were misclassified. For CIFAR-10 dataset, to achieve the same good results as in MNIST experiment, we used 10 networks with at least 128 neurons per hidden layer for each, which will be more complex and ineffective than using a full classification network with the same number of neurons in hidden layer.

Iv Discussion

The experimental results of this work, showed that neural networks of dense hidden layers might not be of a great help to achieve the desired modeling of the recognition/classification task. For a binary classification task, increasing the hidden layer size did not add much to all the aspects of system performance as shown in Table I. Even in complex multi-class recognition tasks like digits and objects identification, the performance becomes nearly the same after a certain layer size. This characterizing layer size will probably depend on the level of abstraction of the assessed problem, however, we can clearly see that good performance can be achieved by using fairly sparse networks.

An acceptable performance can be easily achieved in simple classification tasks using small sized networks and becomes harder in high dimensional tasks. This can be noticed through the differences between training networks for binary classification task and multi-class tasks (Table I and Fig. 1(g)). The classification accuracy is nearly stable at low network density for the binary random data, but needs more hidden neurons to get to the same stable performance for MNIST and CIFAR-10 tasks. This suggested tearing down multi-class recognition tasks to multiple binary classification tasks for which, fast convergence, more simple architectures, and acceptable performance can be reached easier.

Using the binary classification scheme for both MNIST and CIFAR-10 datasets, gave superior performance even with using a single neuron hidden layer. Compared to the highest accuracy achieved for 10-classes recognition, the binary scheme achieved higher classification accuracy for all components. This proves our claim about using populations of binary systems to represent higher dimensional datasets for a better performance and cheap training.

The high accuracy achieved in binary classification networks pushed toward building multi-class recognition based on these networks. Parallel sparse hidden layer, binary networks with the same number as desired classes, were used to build multi-class classifiers with a higher accuracy than a single multi-class network with the same number of hidden neurons used for the same task. A single network with 16 neurons in the hidden layer got a classification accuracy of 82% (Fig. 1(g)) for MNIST dataset while 10 single neuron binary networks achieved around 84% accuracy. The low value of contradicting results from each of the 10 combined networks comes from the fact that the accuracy of each single network is very high that there is a very low chance that an image will get identified in more than one network.

The experiments performed in this study, showed that robust results can be achieved using small number of hidden neurons and these results were confirmed by multiple trials on different datasets. The experiments showed also that using neuron populations in artificial object recognition can achieve a similar performance pattern as anticipated from the biological models. However, it must be taken in consideration that we tested the design concept for multi-object recognition and not for the same object attributes like shape, size, color, and rotation. This can be an indicator that increasing the level of representation may be in favor of the recognition performance. In other words, adding a neuron to the population to tolerate more visual changes of objects may help achieve better recognition performance between objects.

To conclude, we assessed the effect of reducing the hidden layer size in neural networks on the performance of different recognition tasks including binary random data and 10-classes recognition problems. The results showed that high performance can be achieved using fairly small-sized networks from the number of hidden neurons prospective. Moreover, we assessed the use of a population of small sized binary networks in building multi-class recognition systems and it showed superior results compared to multi-class systems with the same hidden layer size. There is more to build over these preliminary findings, therefor, we intend to test more visual aspects in object recognition with neuron populations for better generalization of the proposed design concept.

References

  • [1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, p. 436, 2015.
  • [2] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, Nov 1998.
  • [3] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new perspectives,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 8, pp. 1798–1828, Aug 2013.
  • [4] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in

    Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics

    , ser. Proceedings of Machine Learning Research, Y. W. Teh and M. Titterington, Eds., vol. 9.   Chia Laguna Resort, Sardinia, Italy: PMLR, 13–15 May 2010, pp. 249–256.
  • [5]

    P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Extracting and composing robust features with denoising autoencoders,” in

    Proceedings of the 25th International Conference on Machine Learning, ser. ICML ’08.   New York, NY, USA: ACM, 2008, pp. 1096–1103.
  • [6] M. Chen, W. Dai, S. Y. Sun, D. Jonasch, C. Y. He, M. F. Schmid, W. Chiu, and S. J. Ludtke, “Convolutional neural networks for automated annotation of cellular cryo-electron tomograms,” Nature Methods, vol. 14, no. 10, p. 983, 2017.
  • [7] J. Ma, M. K. Yu, S. Fong, K. Ono, E. Sage, B. Demchak, R. Sharan, and T. Ideker, “Using deep learning to model the hierarchical structure and function of a cell,” Nature Methods, vol. 15, no. 4, p. 290, 2018.
  • [8] Y. Bengio and Y. Lecun, Scaling learning algorithms towards AI.   MIT Press, 2007.
  • [9] Y. Bengio, “Learning deep architectures for ai,” Found. Trends Mach. Learn., vol. 2, no. 1, pp. 1–127, Jan. 2009.
  • [10] B. Zhu, J. Z. Liu, S. F. Cauley, B. R. Rosen, and M. S. Rosen, “Image reconstruction by domain-transform manifold learning,” Nature, vol. 555, no. 7697, p. 487, 2018.
  • [11] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning.   MIT Press, 2016.
  • [12] J. Ba and R. Caruana, “Do deep nets really need to be deep?” in Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, Eds.   Curran Associates, Inc., 2014, pp. 2654–2662.
  • [13] A. Choromanska, M. Henaff, M. Mathieu, G. Arous, and Y. LeCun, “The loss surfaces of multilayer networks,” Journal of Machine Learning Research, vol. 38, pp. 192–204, 2015.
  • [14] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
  • [15] J. V. Gorp, J. Schoukens, and R. Pintelon, “Learning neural networks with noisy inputs using the errors-in-variables approach,” IEEE Transactions on Neural Networks, vol. 11, no. 2, pp. 402–414, Mar 2000.
  • [16] Y. LeCun, L. Bottou, G. B. Orr, and K. R. Müller, Efficient BackProp.   Berlin, Heidelberg: Springer Berlin Heidelberg, 1998, pp. 9–50.
  • [17] A. M. Saxe, J. L. McClelland, and S. Ganguli, “Exact solutions to the nonlinear dynamics of learning in deep linear neural networks,” arXiv preprint arXiv:1312.6120, 2013.
  • [18] Y. N. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y. Bengio, “Identifying and attacking the saddle point problem in high-dimensional non-convex optimization,” in Advances in neural information processing systems, 2014, pp. 2933–2941.
  • [19] J. DiCarlo, D. Zoccolan, and N. Rust, “How does the brain solve visual object recognition?” Neuron, vol. 73, no. 3, pp. 415 – 434, 2012.
  • [20] S. L. Brincat and C. E. Connor, “Underlying principles of visual shape selectivity in posterior inferotemporal cortex,” Nature neuroscience, vol. 7, no. 8, p. 880, 2004.
  • [21] Y. Yamane, E. T. Carlson, K. C. Bowman, Z. Wang, and C. E. Connor, “A neural code for three-dimensional object shape in macaque inferotemporal cortex,” Nature neuroscience, vol. 11, no. 11, p. 1352, 2008.
  • [22] N. C. Rust and J. J. DiCarlo, “Selectivity and tolerance (“invariance”) both increase as visual information propagates from cortical area v4 to it,” Journal of Neuroscience, vol. 30, no. 39, pp. 12 978–12 995, 2010.
  • [23] R. Desimone, T. Albright, C. Gross, and C. Bruce, “Stimulus-selective properties of inferior temporal neurons in the macaque,” Journal of Neuroscience, vol. 4, no. 8, pp. 2051–2062, 1984.
  • [24] J. J. DiCarlo and D. D. Cox, “Untangling invariant object recognition,” Trends in cognitive sciences, vol. 11, no. 8, pp. 333–341, 2007.
  • [25] S. Edelman, Representation and recognition in vision.   MIT press, 1999.
  • [26] C. Koch and I. Segev, “The role of single neurons in information processing,” Nature neuroscience, vol. 3, no. 11s, p. 1171, 2000.
  • [27] A. C. Lorena, A. C. P. L. F. de Carvalho, and J. M. P. Gama, “A review on the combination of binary classifiers in multiclass problems,” Artificial Intelligence Review, vol. 30, no. 1, p. 19, Aug 2009.
  • [28] M. Galar, A. Fernández, E. Barrenechea, H. Bustince, and F. Herrera, “An overview of ensemble methods for binary classifiers in multi-class problems: Experimental study on one-vs-one and one-vs-all schemes,” Pattern Recognition, vol. 44, no. 8, pp. 1761 – 1776, 2011.
  • [29] S. Knerr, L. Personnaz, and G. Dreyfus, “Single-layer learning revisited: A stepwise procedure for building and training a neural network,” Neurocomputing: Algorithms, Architectures and applications. NATO ASI Series, vol. F68, pp. 41–50, 1990.
  • [30] R. Anand, K. Mehrotra, C. K. Mohan, and S. Ranka, “Efficient classification for multiclass problems using modular neural networks,” IEEE Transactions on Neural Networks, vol. 6, no. 1, pp. 117–124, Jan 1995.
  • [31] A. Krizhevsky, “Learning multiple layers of features from tiny images,” Tech. Rep., 2009.
  • [32] P. P. Boyle, “Options: A monte carlo approach,” Journal of Financial Economics, vol. 4, no. 3, pp. 323 – 338, 1977.