Cybernetics as a recognized field of research emerged in the 1940s-50s as a domain of inquiry. Its raison d’etre at the time was to answer the ’why’ and the ’how’ of the awesome performance of biological brains, which were metaphorically regarded as performing computations and other logical operations. This era coincides historically with maturity and endowment of abstract mathematics (foundations of mathematics, set theory, logic) and its entanglement with theoretical physics (quantum theory, the nature of space-time-matter) and theoretical biology (life as a physical phenomenon). In the 1960s and 70s, Artificial Intelligence attracted much attention with the ambitious promise of programming computers to perform human-like pattern recognition and other perceptual-cognitive tasks, such as in human vision. Fundamental theoretical obstacles on this extraordinary claim affirm broader superiority of human cognitive performance in ’natural tasks’ over similar attempts by computing machinery. Theoretical understanding and rigorous mathematical delineation of the limitations of machine learning are at the very heart of AI, and a prerequisite for discovery of alternative technological solutions to overcome such obstructions.
Foundational contributions of Vladimir Vapnik [1, 2]
and others revealed the advantages of inclusion ’empirical’ concepts, and opened the way for machine learning to take advantage of the human factor beyond conventional ANN architectures, frequency-based and Bayesian statistics. In a related direction and independently, neuroscientists H. Barlow- and others made seminal contributions to the role of sparse coding and information in brain’s performance. W. Bialek [24, 25], T. Poggio , T. Sejnowski  and others bridged the gap between the biological-behavioral and the computational-mathematical models of brain function utilizing any or all of the above-mentioned concepts -. In the last decade of the 20th century, a different breakthrough in brain research emerged, and to this date, it continues to receive much attention. D. Field, B. Olshausen and others [11, 12]
took a dramatic turn in the experimental approach to study the brain while performing tasks in ’natural scenes’. This was a breakthrough in thinking about intelligence in its natural setting, and outside the stringent conditions of the conventional labs. There are myriad ideas and no shortage of improvements and synthesis of the many fruitful accomplishments of biological and computational learning. The unique medium, however, offered by broadband, ubiquitous digital communication and the ensuing cyberspace requires a fresh approach to two centuries of milestones in studying intelligence and intelligent behavior. This article takes the historical viewpoint that each of the above-mentioned conceptual breakthroughs "the biological, physical and mathematical structures" have been developing towards the following common regimes: (a) the nature of intelligence and intelligent behavior are dynamic in the sense of physics; the patterns that are observable in a physical approach to biological intelligence are mathematically associated to Complex Systems, thus continually subject to indeterminacy, reasoning under uncertainty and modeling in a probabilistic framework. (b) Such systems have hierarchical organizations at multiple scales, and observations of different levels of the hierarchy require multi-resolution accuracy. (c) The adjacent levels of hierarchy are brought together by non-linear interactions (i.e., require more than the ’superposition and scaling’ that is the hallmarks of any linear theory). (d) Quantifiable communication schemes most probably govern the rules of interactions within and among the hierarchy (sub-) elements. (e) The physical units of quantitative communications for each scale and level of the hierarchy could be ’reconstructed’ from sufficient numbers of observations from the smaller-scale dynamics and behaviors. The unit in one level of the hierarchy is often the Gestalt of an ensemble of entities from physically smaller-scale entities of the hierarchy.
With these preliminaries out-of-the way, we proceed to computationally explore a realization of the concept appropriate for the current state of machine learning, namely, the intelligent behavior emerging from a collection of intelligent agents that are not constrained to be exclusively biological or machine-like. This is referred to as a heterogeneous ensemble in view of the fundamental differences between the human and the machine intelligence, as outlined above. Moreover, the intelligent behavior is expected to be along the lines of Field-Olshausen approach [11, 12] to take on ’natural scenes and natural stimuli’ to probe the nature of intelligent performance. Accordingly, we study the ’natural pattern recognition’ tasks of the type initiated by one or more biological intelligent agents. Indeed, we emphasize the importance of the physical medium for communication as an equally important factor in the exploratory research below. Finally, the physical currency of communication is what we refer to as ’biological information’, while the dynamic process of its manipulation and successfully reaching to a stopping criterion (a solution) is referred to as ’biological computation’. The progress report on biological communication and computation are relegated to other forthcoming articles. The reader, however, could trace the germ of history of idea development in the modest preliminary account below. In the next section we illustrate a practical example of the Collective Cognitive System, and the achieved results of this example are described in the "Experimental Results" section.
The Artificial Neural Networks (ANN) is inspired by the structure and performance of higher animals brain. The human brain with its sophisticated topology can determine and extract significant features of objects, so it is intuitively reasonable to simulate a similar structure for the feature extraction problem as mentioned in the Introduction section. This HHML methodology is an example of the Collective Cognitive System as described above.
The HHML method is composed of a Super Structure Artificial Neural Network (S2AN2), which would be trained like an ANNs for classification . The training process would be performed on labeled data sets with at least two different classes and the process adjusts weights of the network towards performing the training purpose. As proved in 
, weights of a trained ANN represent amount of transitory impact of its corresponding nodes, as the purpose of training process (for a sample classification problem; from now on the purpose of training process is training a classifier). Based on these values, the weights form a ranking of the S2AN2 nodes as well as the nodes in the input layer. Moreover, this ranking can be used to arrange associated features to the input nodes, which represent effectiveness of those features with respect to our training goal. The key idea of the HHML is to study this effectiveness and refine/reduce the set of features. Reexamining the S2AN2 with the reduced set of features and considering precision of its results, gives a good evaluation of the correctness of the method, which is considered in the "Experimental Result" section.
The S2AN2 uses the Back Propagation algorithm and is composed of two hyper-layers; the first one is designed to study effectiveness of features. In this layer, for each class there is a Unit Back Propagation ANN (UBP) that gets all the inputs (features) and, by the end of the training process, each of these UBP’s shows how much every feature has been decisive for learning its associated class. On the next step, for a training dataset with K classes, the second hyper layer includes a UBP with K inputs and shows object ID’s in its output layer. For each UBP, the number of hidden layers and their nodes is in direct relationship with original number of features. The employed activation function is the common sigmoid function. A template of a UBP is depicted in Figure1.
As mentioned, there are ’K’ UBPs in the first hyper-layer, where all the features fed to all of them. In this layer each UBP has an output node that returns a real number between [-1, 1] corresponding to the amount of collaboration of that UBP for calculating the corresponding class ID of the input. Obviously, one can train the network such that assigns each UBP to a class. The output node of a UBP takes the value ’1’ when the object belongs to the class associated to that UBP and this value goes to ’-1’ as far the uncertainty about the class membership goes up. This is a floating point number respective to the similarity of classes. Output nodes in this hyper layer are connected to input nodes of the second hyper layer with constant edge (having the constant weight of 1). These edges are assigned a constant value in order to be neutral in the course of the learning process while connecting the two hyper-layers. During the training process, the UBP in the second hyper-layer analyzes outputs of the K previous UBP’s and based on them determines class ID’s. The UBP in the second layer calculates its error, update its weights and back propagates errors to the previous Hyper-Layer. Based on these errors each UBP in the first Hyper-Layer calculates its local error and propagates the error for updating its own weights. The topology of the S2AN2 is given in the Figure2.
Our method can be categorized as a zero-order method of model-dependent feature selection, which uses the network parameters only. This means that the selection of the important features of an input is decided by considering only the weights of our specific structure. This gestalt is a batch process in which we design a fully connected network for each class and train them to specifically handle their own class objects. Moreover, results of these networks are processed using another network (in the second hyper-layer) towards determining class ID. The whole structure could be viewed as a homogeneous ensemble of ANNs that use them both as building blocks and the final integrator. This has the very useful advantage of structural and computational homogeneity that makes it suitable for parallel hardware design and implementation, and in turn, yields into super fast feature selection (special purpose) hardware.
3 Experimental Results
For a dataset of objects inside of a predetermined feature space, a feature extraction algorithm is supposed to choose a subset of features such that this subset can capture whole (or as much as possible of) objects’ information. The HHML algorithm is examined on classification problem of two different data sets: one from astronomy and another from plant biology, where experts labeled them (provided the class ID’s) using a priori knowledge. The former comes from light curves of the SMC stars  from the OGLE mission . The latter is a data set of root growth movies and consists of two classes of roots; namely wild type and mutant root seedlings. Following the process of the dimensionality reduction via the HHML method is illustrated subject to the data sets. In both cases the precision values are calculated based on the common formula:
3.1 The HHML dealing with an astronomy data set
The astronomy dataset is a set of information extracted from the results of processing stars light curves. These data sets include all the information that can be extracted from the light curves, where these information or features capture all classes of stars and previously  have been employed for distinguishing between different types of stars (SMC) . Since magnitude of data sets is massive and is growing every second, determining significant features from the extracted features becomes a vital problem to solve. In this experiment, the HHML method is trained on 10000 objects (stars) from the OGLE mission , where 13 features represent them inside of 10 classes. Based on this information, the S2AN2 structure consists of 13 input nodes for the 13 features that are connected to 10 UBP in the first hyper layer. From the first hyper layer 10 outputs fed into another UBP in the second hyper layer. The UBP in the second layer has 4 output nodes subject to 4-digit representation of the 10 class ID’s. The topology of the S2AN2 is shown in Figure3 and details of the UBP’s are tabulated in the Table 1.
Observing amount of collaboration of features based on weights of edges in the trained network (Figure4 and Table 2) show that with the first 8 features we could capture all the classes. The refined network has applied on another test SMC data set with 40,000 stars and the results of the classification demonstrated precision of the reduced set of features. The classification processes on the test data set are performed and resulted precisely which is presented in the Table 3.
|over original DB||Class6||Class7||Class8||Class9||Class10|
3.2 The HHML performance using a plant biology data set
This data set includes 500 movies of growth of Arabidopsis Thaliana seedlings. We chose equal number of movies from wild type and mutant roots, where 400 of them is used for training process and evaluation was performed on the rest 100 movies. Each movie is composed of 10 frames, which are used to distinguish between the classes. [The concept of predicting genotypic modification of quantitative phenotypic traits is a well-known concept and is called Phenotype to Genotype mapping ]. Hence, these frames are assumed as the proposed features of objects in each class and the HHML method will determine which subset of the features is representative of all the classes. A sample movie is shown in the Figure5.
Each frame is a 740x740 matrix of a root shape and a vector representation of this matrix is used for feeding into the network. So in this case, each feature instead of being a value is an array and the S2AN2 uses an extra UBP for each feature (a priori to UBP’s of classes) in the first hyper-layer. In order to handle this type of features the applied topology for the S2AN2 is as follows:
An UBP for handling a feature array. (10 feature so 10 UBPs -UBP(1)- in the first hyper layer).
All UBP(1) feed their outputs to two UBPs corresponding to the two classes (2 classes, so 2 other UBPs -UBP(2)- in the first hyper layer).
Results of UBP(2) feed a UBP -UBP(3)- in the second hyper layer towards calculating class IDs.
In this structure all UBPs use a sigmoid function and structure of each UBP -according to the features information- is illustrated in Table 4.
The error of the second layer back propagates to update weights of edges in both hyper layers. Whereas big portion of the frames is blank, the UBP(1) is employed to determine which part of the input vector is more significant. Visualization of insignificant part can be seen in Figure6; vectors of six frames are projected into a two dimensional space. In fact, in this problem the S2AN2 performs two steps of dimension reduction process, it simultaneously reduces dimension of each feature (frame) and in addition extracts significant features from the 10 proposed features. Both reduction processes rely on the HHML concept and use weights of the trained S2AN2 to refine the database.
When the training process finishes, observing the weights of edges in all UBP(1) shows which part of the input vector (the image) is meaningful so the vector dimension should be reduced accordingly. Summing up all the weights in all the UBP(1) networks and converting the achieved vector to a matrix schema shows the significant part of the images. This matrix shows level of importance of the image pixels and observing this matrix shows that the frames can be reduced to 417x397 matrices, where all the significant pixels in all frames are included. Mapping a frame into this matrix is represented in Figure 7
. Moreover, our experiments showed that from the 10 proposed features we just need 4 frames (5, 6, 9,10) to classify the movies. We reconstruct the S2AN2 for the reduced features and examined the network on the test part of the data set (50 movies for each class). Calculating Mean and Variance values of the edges weights gives a measure of the distribution of weights. Our cutoff value is equal to Mean-0.3*Variance, where lower weights changed to zero (associated nodes and edges removed from the S2AN2). Table5 shows accuracy of the network on the test data set and Table 6 shows the ANN’s resource usages when the ANN was performed on original and refined databases. The resource usages and accuracy comparisons demonstrate our algorithms practicality for analyzing biological databases, where on this feature space, fast algorithms can be run using a small amount of RAM.
|Accuracy||100 Movies (50 Training||100 Movies (50 Training|
|refined over||50 Testing) from||50 Testing) from|
|original DB||the training DB||the testing DB|
In this research, a new viewpoint towards tackling complex feature extraction problems is proposed. The inspiration comes from the historical advances in understanding and modeling intelligent behavior, which includes feature extraction as a cornerstone. This preliminary progress report has focused on the results of computational and algorithmic design for modeling and realization the conceptual framework. The theoretical considerations are relegated to an upcoming companion article. The computations above are based on the prevalent BP-ANN training architectures and their well-established learning abilities. The training process is modeled after the concept of a ’pipeline’ where inputs are processed with the additional provision of determining inputs’ role in calculating the results of the output nodes. We evaluated our method using two different databases of movies from biological systems and astronomical observations. The assessment poses that the architecture extracts the biologically significant parts of the frames, and provided a novel method for ’dimensionality and size reduction’ by orders of magnitude for movies in the plant biology data set. Manipulating an entirely different pattern recognition task, an astronomy data set was studied using our model. The significant astronomy data features were also extracted. The latter outcomes are along the lines of the scientific objectives to provide helpful software tools for future cosmology missions. We are confident that our method is applicable to other domains and classes of similar problems. Our work in progress includes development of an improved version that could be applied to solve problems using the ultimate power of the parallel processing platforms. Also, the project is continuing with positive progress to solve practical applications for dimensionality reduction on-demand for astronomy data sets as they are becoming available by the GAIA  mission in the European Space Agency .
The authors wish to thank Professor Luis S. Baro for his valuable helps on understanding the astronomical problem and providing the data sets. The authors also wish to thank Nathan Miller and Tessa Durham (Department of Botany, University of Wisconsin-Madison) for samples of the root growth movies.
V. N. Vapnik
The Nature of Statistical Learning Theory 1998.
-  V. N. Vapnik Statistical Learning Theory 1998.
-  E. Schneidman, et al. Network information and connected correlations Phys Rev Lett, vol. 91, 2003.
-  L. C. Osborne, et al. Time course of information about motion direction in visual area MT of macaque monkeys ,J Neurosci, vol. 24, 2004.
-  J. Bouvrie, et al. On Invariance in Hierarchical Models ,Advances in Neural Information Processing Systems, vol. 22, 2009.
S. Chikkerur, et al.,
A Bayesian inference theory of attention: neuroscience and algorithmsMIT-CSAIL-TR-, 2009.
-  Tomaso Poggio and S. Smale The Mathematics of Learning: Dealing with data ,Notices of the AMS, vol. 50, 2003.
-  J.-M. Fellous, et al. Discovering spike patterns in neuronal responses Journal of Neuroscience, 2004.
-  S. B. Laughlin and T. J. Sejnowski Communication in neuronal networks Science, vol. 301, 2003.
-  A. P. Prescott, et al. interaction between shunting and adaptation controls a switch between integration and coincidence detection in pyramidal neurons Neuroscience, vol. 26, 2006.
-  B. A. Olshausen and D. J. Field Vision and the Coding of Natural Images American Scientist, vol. 88, pp. 238-24, 2000.
-  B. A. Olshausen and D. J. Field Sparse Coding of Sensory Inputs ," Current Opinion in Neurobiologyt, vol. 14, pp. 481-487, 2004.
-  G. P. Zhang Neural Networks for Classification: A Survey IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS, vol. 30, 2000.
-  M. J. Healy and T. P. Caudell A Categorical Semantic Analysis of ART Architectures presented at the IJCNN’01:International Joint Conference on Neural Networks, 2001.
-  L. K. Stefano Rubele, Leo Girardi The star formation history of the SMC star cluster NGC419 Solar and Stellar Astrophysics, 2009.
-  OGLE Available: http://ogle.astrouw.edu.pl/
J. D. L. M. Sarro, C. Aerts, M. L pez
Comparative clustering analysis of variable stars in the Hipparcos, OGLE Large Magellanic Cloud and CoRoT exoplanet databasesAstronomy and Astrophysics, 2009
-  M. V. Rockman Reverse engineering the genotype phenotype map with natural genetic variation Nature, 2008
-  GAIA Available: http://sci.esa.int/science-e/www/area/index.cfm?fareaid=26
-  ESA Available: http://www.esa.int/esaCP/index.html
-  H. B. BARLOW Pattern recognition and the responses of sensory neurons Annals of the New York Academy of Sciences, vol. 156, pp. 872-881, 1969.
-  H. B. BARLOW Cerebral cortex as a model builder John Wiley, 1985.
-  H. B. BARLOW Matters of Intelligence John Wiley, 1987
-  M. DeWeese and W. Bialek Information flow in sensory neuronsII Nuovo Cimento D, vol. 17, pp. 733-741, 1995.
-  D. Warland, et al. Reading between the spikes in the cricket cercal afferent system Analysis and Modeling of Neural Systems, , pp. 327-333, 1991.
-  T. Poggio A theory of how the brain might work presented at the In Cold Spring Harbor Symposia on Quantitative Biology, 1990.
-  A. J. Bell and T. J. Sejnowski An Information-Maximization Approach to Blind Separation and Blind Deconvolution, Neural Computation Neural Computation, vol. 7, pp. 1129-1159, 1995.