I Introduction
Over the last several years, machine learning techniques, particularly neuromorphic computing architectures, have been used to mimic the challenging pattern recognition abilities of biological systems
[1]. In fact, it could be argued that neural networks based systems are learning feature hierarchies automatically so that they are replacing many learning models that depend on handdesigned heuristics
[1, 2, 3]. This is achieved through multiple levels of representations generated by successive nonlinear modules, each of which transforms low level representations into more abstract high level representations [1, 4, 5, 6, 7]. Theoretical results have shown that deep architectures may be needed to efficiently model complex functions and represent high levels of abstraction required for challenging recognition and AI tasks [8, 9, 5, 10].Many questions come up when choosing an architecture to address a practical problem, including how many layers we should use and how large each layer is supposed to be [11, 12]. Network capacity increases with both number and size of hidden layers due to the fact that neurons collaborate to express complex functions and achieve better generalization [11]
. However, high capacity models may fit also dataset outliers and lead to overfitting. Although, one can always use shallow networks to avoid overfitting, but sometimes data complexity necessitates using deep architectures
[11, 13]. Multiple solutions have been suggested though to avoid overfitting in deep networks including dropout [14], regularization, and noisy inputs [11, 15, 13].In practice, overfitting control is preferred over network size tuning as it was found that smaller networks are hard to train using local optimizers like gradient descent [13, 16, 17, 18]. Obviously, smaller networks will have fewer local solutions with easy convergence, however, most of these solutions are unreliable (high loss) especially when using batch optimization [13]
. This also makes small sized networks vulnerable to huge loss variance influenced by random initializations
[11]. It is hard to prove this mathematically due to the poor understanding of loss function and its shape for neural networks
[13]. On the other hand, simplifying the problem through conquering into smaller problems or multiple recognition stages may help build efficient, yet cheap to train models.Understanding the response of individual neurons in parts of brain that are highly believed to be responsible for visual object recognition, is quite challenging and hard to predict [19, 20, 21]. Nevertheless, it is known that they are activated by a set of complex visual features and, thus they cannot be narrowly tuned as detectors for specific objects [19, 22, 23]. Therefore, single neurons may not be acting as sparse object detectors, but, rather as elements of a group that as a whole, provides object recognition [19, 22]. However, they are often able to maintain preferences among objects like changes in size and shape which is called neural tolerance [19, 20]. This suggests that single neurons together decode the object space and identity including position, size, context, and other variables to achieve powerful representation and avoid binding this information at successive stages [19, 24, 25, 26]. As an analogy of how brain single neurons behave in visual recognition tasks, artificial neural networks can make use of it as a design concept in developing populations of single neuron or smallsized networks for object recognition.
The design concept introduced here, has proven effectiveness for many classifiers and configurations including SVM and neural networks
[27, 28, 29, 30]. The reason it is often successful for classification lies in the fact that it is easier to build classifiers to separate two classes rather than for multiple classes [28]. Usually, the approach is referred to as decomposition [28][28], or modularization [30] and has been addressed in different ways. The first used decomposition strategy is called ”onevsone” (OVO) and it divides the problem into as many binary classification problems as the number of unique combinations between each two classes [28, 29]. The second strategy is called ”onevsall” (OVA) and it trains a classifier to discriminate each class from all other classes [28, 29].Modularization in neural networks has been introduced by [29] through attempting to separate each class from all other classes and then subsequently pairwise separate each class using subnetworks. Furthermore, [30] used K networks to reduce a Kclass problem into a set of K two class problems, however they focused more on the backpropagation algorithm and enhancing its convergence speed. In this work, we used an OVA approach as in [30] and we focused more about the size of modular networks used for each binary subclassification problem and the performance of the network.
This paper addresses the feasibility of neuron population based design approach through answering three main questions: 1. How does the model performance change with the size of hidden layer in the used network? 2. Is it possible that single neuron or smallsized networks can achieve good performance with the binary subclassification problems of high dimensional multiclass data? 3. Can binary classification networks with low number of hidden neurons (especially single neuron networks) be utilized to perform multiclass recognition? and what is the efficiency of these systems compared to multiclass dense systems? To answer these questions, we experimented multiple architectures along with different levels of abstraction in recognition tasks for three datasets.
Ii Methods
Iia Datasets
To conduct experimental results about the main objectives of this work, we used three main datasets for all trials. The first dataset is composed of 3 dimensional data points drawn from a normal distribution with different values of mean and standard deviation. Particularly, we used a combination of a single value for mean (0) and 3 values for standard deviation (0.1, 0.5, 1). Multiple numbers of data points were used as well for each set, which gives a total of nine sets (3 sets of different lengths drawn from each distribution). Two categories were created inside the generated sets through positively and negatively biasing samples of the data with the same amount. Fig.
1(ad) shows a sample of the generated sets with a mean of 0 and standard deviation of 1. The categories for this set were created by adding to half of the set and subtracting the same value from the other half.The second dataset is the famous MNIST [2], widely used for handwritten digit recognition benchmarking. MNIST consists of two sets of images, the first set is 60,000 examples for training and a test set of 10,000 examples. All digits are sizenormalized and centered in gray scale images (Fig. 1(e)). MNIST dataset doesn’t require formatting or preprocessing which makes it optimal for testing learning techniques and new recognition algorithms [2]. The third dataset is the CIFAR10 dataset [31] which consists of 60,000 color images each of pixels in classes (Fig. 1(f)). The dataset is divided into two sets, one for training (50,000 images) and another for testing (10,000 images).
IiB CNN Design
The main architecture of CNN used for all multiclass recognition experiments performed in this analysis, included two convolutional layers, two max pooling layers (one after each convolutional layer), a single hidden dense layer (ReLU activated), and an output layer with a number of units equal to the number of classes in the target dataset. The first convolutional layer applies 32
filters with ReLU activation followed by afilter max pooling with strides of 2 (non overlapping pooled regions). The second convolutional layer applies 64
filters with ReLU activation followed by a filter max pooling with strides of 2. For the hidden layer, we usedunits for the different experiments performed. All experiments developed for this analysis were implemented using Tensorflow
^{TM} and Matlab; and tested over an NVIDIA Tesla M40 GPU.IiC Effect of Layer Size
The number of hidden neurons used in different network architectures is assessed here through running multiple trials over the 3 datasets mentioned before. First, we trained three feedforward networks for a simple binary classification task over all of the randomly generated sets. Each set was divided into 80% for training and 20% for testing of its length. The three networks are of one hidden layer but one with a single neuron, the second with 10 neurons, and the last one with 100 neurons. The average of results was taken over multiple iterations (100 to 1000) to avoid randomness of the experiment [32].
Second, to see the effect of the number of hidden neurons on more complex recognition problems, we trained a convolutional neural network (CNN) for MNIST handwritten digit recognition and another for CIFAR10 content recognition. Both networks have two convolutional layers, two pooling layers (one after each convolutional layer), and a single hidden dense layer (ReLU activated). The first convolutional layer applies 32
filters with ReLU activation followed by a filter max pooling with strides of 2 (non overlapping pooled regions). The second convolutional layer applies 64 filters with ReLU activation followed by a filter max pooling with strides of 2. For the hidden layer, we used in order to examine the effect of layer size on the recognition task in both datasets.IiD Effect of Tearing down the Recognition Problem
As mentioned before, dense and deep networks may be needed to model complex recognition problems. However, this will come with a cost, overfitting and expensive training. So, here we test the ability of tearing down multiclass recognition problems into a series of binary classifiers. MNIST and CIFAR10 datasets were used also for this experiment but with converting each of them into 10 different datasets from the labels point of view. In other words, the labels of each dataset were altered 10 times to classify only one category out of the 10 categories against all the others. For MNIST dataset we have 10 classification problems, the first is to classify the digit against , the second is to classify the digit against , and so on. The same also is performed for the CIFAR10 dataset but considering its different categories. All generated datasets were balanced in terms of classes before being used in training.
We used the same convlutional network mentioned before and tried different sizes for hidden layer too. Ten networks were trained for each dataset and the results were compared with small sized, full classification networks tested with the same datasets. In order to use these parallel networks for 10 classes recognition, the input image is administered to each of the 10 networks and the resultant digit is represented by the network that gives a positive response as in Fig. 2. Due to the independence of the 10 networks, there might be redundancy in the output (i.e. an image identified more than once throughout the 10 networks). This can be solved through priority encoding the results (take the first positive result and ignore the rest of networks) or consider this as misclassification. For this study, we considered redundancy as misclassification. We saw it as a more convenient way for performance comparison between different architectures.
Network Size  Single Neuron Net  10Neurons Net  100Neurons Net  

Sample Size  
Accuracy  0.693  0.727  0.703  0.771  0.784  0.782  0.753  0.778  0.782  
Sensitivity  0.695  0.727  0.704  0.775  0.783  0.73  0.758  0.779  0.781  
Specificity  0.691  0.728  0.702  0.768  0.786  0.782  0.750  0.777  0.782  
Accuracy  0.706  0.698  0.700  0.772  0.781  0.782  0.760  0.778  0.782  
Sensitivity  0.694  0.698  0.701  0.760  0.782  0.783  0.756  0.778  0.781  
Specificity  0.720  0.698  0.700  0.786  0.782  0.780  0.765  0.778  0.782  
Accuracy  0.701  0.6936  0.700  0.776  0.780  0.781  0.756  0.779  0.781  
Sensitivity  0.699  0.693  0.700  0.775  0.785  0.781  0.767  0.777  0.781  
Specificity  0.703  0.693  0.699  0.778  0.775  0.781  0.747  0.780  0.782 
Iii Results
Changing the hidden layer size in a neural network will definitely affect the performance of the network, but, the question is whether the change in performance is worth it or not. Testing networks of different hidden layer sizes using the random generated data gave the results shown in Table I. On the other hand, testing over a wider range of hidden layer sizes for multiclass complex data showed that increasing the number of hidden neurons is not quite effective after a certain point. The accuracy of 10 classes recognition task for both MNIST and CIFAR10 remains nearly constant after a hidden layer size of 128 neurons as shown in Fig. 1
(g). Probably this size will be different from a dataset to another and even from a recognition task to another, however, this shows that the same performance can be achieved with way smaller networks. No changes in the total loss pattern were observed as well after this layer size as shown in Fig.
1(h) and (i). In addition to that, small sized networks were found to suffer from high loss despite of faster convergence which can be clearly seen in CIFAR10 total loss in Fig. 1(i).Dataset  MNIST/Single Neuron Net  CIFAR10/128 Neurons Net  

Zero against all  0.982  Airplane against all  0.959  
One against all  0.985  Automobile against all  0.976  
Two against all  0.978  Bird against all  0.952  
Three against all  0.960  Cat against all  0.948  
Four against all  0.997  Deer against all  0.955  
Five against all  0.985  Dog against all  0.944  
Six against all  0.985  Frog against all  0.960  
Seven against all  0.975  Horse against all  0.963  
Eight against all  0.956  Ship against all  0.977  
Accuracy 
Nine against all  0.932  Truck against all  0.961 
Since overreduction of network size alone, shows poor performance in multiclass recognition problems as it appears in Fig. 1(g), simplifying the recognition problem might be the solution. Given the performance of single neuron networks for binary classification in Table I, we test the ability of tearing down multiclass recognition problems into a set of binary classifiers using small sized networks as in Fig. 2. Table II shows a sample of the results for tearing down the MNIST and CIFAR10 datasets into 10 different binary classification problems. It seems that using a small sized hidden layer along with a binary problem gives better results than the best performance achieved using high density layers used with multiclass recognition problems considering all classes. For MNIST problem, good results were achieved using a single neuron in hidden layer. On the other hand, CIFAR10 classification accuracy did not jump over 80% before using 64 neurons in the hidden layer.
The 10 networks trained for binary classification problems for MNIST were combined together to form a 10classes recognizer as in Fig. 2. The new system correctly identified 84.21% of the test data, produced multiple classification results for 12.79% of the test data (maximum of two positive results per image appeared), and the rest of test data (3%) were misclassified. For CIFAR10 dataset, to achieve the same good results as in MNIST experiment, we used 10 networks with at least 128 neurons per hidden layer for each, which will be more complex and ineffective than using a full classification network with the same number of neurons in hidden layer.
Iv Discussion
The experimental results of this work, showed that neural networks of dense hidden layers might not be of a great help to achieve the desired modeling of the recognition/classification task. For a binary classification task, increasing the hidden layer size did not add much to all the aspects of system performance as shown in Table I. Even in complex multiclass recognition tasks like digits and objects identification, the performance becomes nearly the same after a certain layer size. This characterizing layer size will probably depend on the level of abstraction of the assessed problem, however, we can clearly see that good performance can be achieved by using fairly sparse networks.
An acceptable performance can be easily achieved in simple classification tasks using small sized networks and becomes harder in high dimensional tasks. This can be noticed through the differences between training networks for binary classification task and multiclass tasks (Table I and Fig. 1(g)). The classification accuracy is nearly stable at low network density for the binary random data, but needs more hidden neurons to get to the same stable performance for MNIST and CIFAR10 tasks. This suggested tearing down multiclass recognition tasks to multiple binary classification tasks for which, fast convergence, more simple architectures, and acceptable performance can be reached easier.
Using the binary classification scheme for both MNIST and CIFAR10 datasets, gave superior performance even with using a single neuron hidden layer. Compared to the highest accuracy achieved for 10classes recognition, the binary scheme achieved higher classification accuracy for all components. This proves our claim about using populations of binary systems to represent higher dimensional datasets for a better performance and cheap training.
The high accuracy achieved in binary classification networks pushed toward building multiclass recognition based on these networks. Parallel sparse hidden layer, binary networks with the same number as desired classes, were used to build multiclass classifiers with a higher accuracy than a single multiclass network with the same number of hidden neurons used for the same task. A single network with 16 neurons in the hidden layer got a classification accuracy of 82% (Fig. 1(g)) for MNIST dataset while 10 single neuron binary networks achieved around 84% accuracy. The low value of contradicting results from each of the 10 combined networks comes from the fact that the accuracy of each single network is very high that there is a very low chance that an image will get identified in more than one network.
The experiments performed in this study, showed that robust results can be achieved using small number of hidden neurons and these results were confirmed by multiple trials on different datasets. The experiments showed also that using neuron populations in artificial object recognition can achieve a similar performance pattern as anticipated from the biological models. However, it must be taken in consideration that we tested the design concept for multiobject recognition and not for the same object attributes like shape, size, color, and rotation. This can be an indicator that increasing the level of representation may be in favor of the recognition performance. In other words, adding a neuron to the population to tolerate more visual changes of objects may help achieve better recognition performance between objects.
To conclude, we assessed the effect of reducing the hidden layer size in neural networks on the performance of different recognition tasks including binary random data and 10classes recognition problems. The results showed that high performance can be achieved using fairly smallsized networks from the number of hidden neurons prospective. Moreover, we assessed the use of a population of small sized binary networks in building multiclass recognition systems and it showed superior results compared to multiclass systems with the same hidden layer size. There is more to build over these preliminary findings, therefor, we intend to test more visual aspects in object recognition with neuron populations for better generalization of the proposed design concept.
References
 [1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, p. 436, 2015.
 [2] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradientbased learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, Nov 1998.
 [3] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new perspectives,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 8, pp. 1798–1828, Aug 2013.

[4]
X. Glorot and Y. Bengio, “Understanding the difficulty of training deep
feedforward neural networks,” in
Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics
, ser. Proceedings of Machine Learning Research, Y. W. Teh and M. Titterington, Eds., vol. 9. Chia Laguna Resort, Sardinia, Italy: PMLR, 13–15 May 2010, pp. 249–256. 
[5]
P. Vincent, H. Larochelle, Y. Bengio, and P.A. Manzagol, “Extracting and composing robust features with denoising autoencoders,” in
Proceedings of the 25th International Conference on Machine Learning, ser. ICML ’08. New York, NY, USA: ACM, 2008, pp. 1096–1103.  [6] M. Chen, W. Dai, S. Y. Sun, D. Jonasch, C. Y. He, M. F. Schmid, W. Chiu, and S. J. Ludtke, “Convolutional neural networks for automated annotation of cellular cryoelectron tomograms,” Nature Methods, vol. 14, no. 10, p. 983, 2017.
 [7] J. Ma, M. K. Yu, S. Fong, K. Ono, E. Sage, B. Demchak, R. Sharan, and T. Ideker, “Using deep learning to model the hierarchical structure and function of a cell,” Nature Methods, vol. 15, no. 4, p. 290, 2018.
 [8] Y. Bengio and Y. Lecun, Scaling learning algorithms towards AI. MIT Press, 2007.
 [9] Y. Bengio, “Learning deep architectures for ai,” Found. Trends Mach. Learn., vol. 2, no. 1, pp. 1–127, Jan. 2009.
 [10] B. Zhu, J. Z. Liu, S. F. Cauley, B. R. Rosen, and M. S. Rosen, “Image reconstruction by domaintransform manifold learning,” Nature, vol. 555, no. 7697, p. 487, 2018.
 [11] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016.
 [12] J. Ba and R. Caruana, “Do deep nets really need to be deep?” in Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2014, pp. 2654–2662.
 [13] A. Choromanska, M. Henaff, M. Mathieu, G. Arous, and Y. LeCun, “The loss surfaces of multilayer networks,” Journal of Machine Learning Research, vol. 38, pp. 192–204, 2015.
 [14] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
 [15] J. V. Gorp, J. Schoukens, and R. Pintelon, “Learning neural networks with noisy inputs using the errorsinvariables approach,” IEEE Transactions on Neural Networks, vol. 11, no. 2, pp. 402–414, Mar 2000.
 [16] Y. LeCun, L. Bottou, G. B. Orr, and K. R. Müller, Efficient BackProp. Berlin, Heidelberg: Springer Berlin Heidelberg, 1998, pp. 9–50.
 [17] A. M. Saxe, J. L. McClelland, and S. Ganguli, “Exact solutions to the nonlinear dynamics of learning in deep linear neural networks,” arXiv preprint arXiv:1312.6120, 2013.
 [18] Y. N. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y. Bengio, “Identifying and attacking the saddle point problem in highdimensional nonconvex optimization,” in Advances in neural information processing systems, 2014, pp. 2933–2941.
 [19] J. DiCarlo, D. Zoccolan, and N. Rust, “How does the brain solve visual object recognition?” Neuron, vol. 73, no. 3, pp. 415 – 434, 2012.
 [20] S. L. Brincat and C. E. Connor, “Underlying principles of visual shape selectivity in posterior inferotemporal cortex,” Nature neuroscience, vol. 7, no. 8, p. 880, 2004.
 [21] Y. Yamane, E. T. Carlson, K. C. Bowman, Z. Wang, and C. E. Connor, “A neural code for threedimensional object shape in macaque inferotemporal cortex,” Nature neuroscience, vol. 11, no. 11, p. 1352, 2008.
 [22] N. C. Rust and J. J. DiCarlo, “Selectivity and tolerance (“invariance”) both increase as visual information propagates from cortical area v4 to it,” Journal of Neuroscience, vol. 30, no. 39, pp. 12 978–12 995, 2010.
 [23] R. Desimone, T. Albright, C. Gross, and C. Bruce, “Stimulusselective properties of inferior temporal neurons in the macaque,” Journal of Neuroscience, vol. 4, no. 8, pp. 2051–2062, 1984.
 [24] J. J. DiCarlo and D. D. Cox, “Untangling invariant object recognition,” Trends in cognitive sciences, vol. 11, no. 8, pp. 333–341, 2007.
 [25] S. Edelman, Representation and recognition in vision. MIT press, 1999.
 [26] C. Koch and I. Segev, “The role of single neurons in information processing,” Nature neuroscience, vol. 3, no. 11s, p. 1171, 2000.
 [27] A. C. Lorena, A. C. P. L. F. de Carvalho, and J. M. P. Gama, “A review on the combination of binary classifiers in multiclass problems,” Artificial Intelligence Review, vol. 30, no. 1, p. 19, Aug 2009.
 [28] M. Galar, A. Fernández, E. Barrenechea, H. Bustince, and F. Herrera, “An overview of ensemble methods for binary classifiers in multiclass problems: Experimental study on onevsone and onevsall schemes,” Pattern Recognition, vol. 44, no. 8, pp. 1761 – 1776, 2011.
 [29] S. Knerr, L. Personnaz, and G. Dreyfus, “Singlelayer learning revisited: A stepwise procedure for building and training a neural network,” Neurocomputing: Algorithms, Architectures and applications. NATO ASI Series, vol. F68, pp. 41–50, 1990.
 [30] R. Anand, K. Mehrotra, C. K. Mohan, and S. Ranka, “Efficient classification for multiclass problems using modular neural networks,” IEEE Transactions on Neural Networks, vol. 6, no. 1, pp. 117–124, Jan 1995.
 [31] A. Krizhevsky, “Learning multiple layers of features from tiny images,” Tech. Rep., 2009.
 [32] P. P. Boyle, “Options: A monte carlo approach,” Journal of Financial Economics, vol. 4, no. 3, pp. 323 – 338, 1977.
Comments
There are no comments yet.