The unreasonable effectiveness of small neural ensembles in high-dimensional brain

Despite the widely-spread consensus on the brain complexity, sprouts of the single neuron revolution emerged in neuroscience in the 1970s. They brought many unexpected discoveries, including grandmother or concept cells and sparse coding of information in the brain. In machine learning for a long time, the famous curse of dimensionality seemed to be an unsolvable problem. Nevertheless, the idea of the blessing of dimensionality becomes gradually more and more popular. Ensembles of non-interacting or weakly interacting simple units prove to be an effective tool for solving essentially multidimensional problems. This approach is especially useful for one-shot (non-iterative) correction of errors in large legacy artificial intelligence systems. These simplicity revolutions in the era of complexity have deep fundamental reasons grounded in geometry of multidimensional data spaces. To explore and understand these reasons we revisit the background ideas of statistical physics. In the course of the 20th century they were developed into the concentration of measure theory. New stochastic separation theorems reveal the fine structure of the data clouds. We review and analyse biological, physical, and mathematical problems at the core of the fundamental question: how can high-dimensional brain organise reliable and fast learning in high-dimensional world of data by simple tools? Two critical applications are reviewed to exemplify the approach: one-shot correction of errors in intellectual systems and emergence of static and associative memories in ensembles of single neurons.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 9

page 31

01/14/2020

High–Dimensional Brain in a High-Dimensional World: Blessing of Dimensionality

High-dimensional data and high-dimensional representations of reality ar...
06/27/2019

Symphony of high-dimensional brain

This paper is the final part of the scientific discussion organised by t...
01/10/2018

Blessing of dimensionality: mathematical foundations of the statistical physics of data

The concentration of measure phenomena were discovered as the mathematic...
11/11/2018

Correction of AI systems by linear discriminants: Probabilistic foundations

Artificial Intelligence (AI) systems sometimes make errors and will make...
10/03/2016

One-Trial Correction of Legacy AI Systems and Stochastic Separation Theorems

We consider the problem of efficient "on the fly" tuning of existing, or...
10/12/2018

Fast Construction of Correcting Ensembles for Legacy Artificial Intelligence Systems: Algorithms and a Case Study

This paper presents a technology for simple and computationally efficien...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Physics aims at describing the widest realms of reality with as few basis or fundamental principles as possible. Over centuries, from Newton to Einstein, the main strive was to discover the ‘laws of Nature’. These laws have to be beautiful and simple, and the explained portion of reality should be as large as possible. Nevertheless, according to Feynman, ‘We never are definitely right, we can only be sure we are wrong. … In other words we are trying to prove ourselves wrong as quickly as possible, because only in that way can we find progress’ FeynmanCharacter .

Once we proved our current laws wrong, new laws are needed. Discovering new laws, according to Einstein, can be viewed as a ‘flight from miracle’ Einstein : ‘The development of this world of thought is in a certain sense a continuous flight from ‘miracle’.’ What does it mean? Let us imagine: we have laws, beautiful and simple (the Newtonian mechanics, for example). Then we find a phenomenon that these laws cannot explain. This is a miracle, a phenomenon that contradicts the basis laws. However, we like these laws and try to use them again and again to describe the miracle. If we fail then we have to use another way. We like our laws but we like rationality more, and therefore we fly away from the miracle by inventing new laws, which are beautiful, simple and, at the same time, explain the phenomenon. After that, the miracle disappears and we have new laws, beautiful and simple (Fig. 1 a) GorbanMainstream2006 .

Figure 1: (a) The flight from miracle: Einstein’s road; (b) The struggle with complexity: the life battle of the model engineers.

The cycle of creation of new laws, their analysis followed by deduction of their consequences and drawing predictions requires an effective machinery. Mathematics has long been recognized as such. Wigner found this effectiveness mysterious and unreasonable Wigner1960 : ‘The miracle of the appropriateness of the language of mathematics for the formulation of the laws of physics is a wonderful gift which we neither understand nor deserve.’ There were many attempts to explain this miracle. The main idea was: Mathematics develops as a gradual ascent on stairs of abstractions. It consists of tools to solve problems, that have arisen in the development of tools to solve problems, that have arisen…, … in developing tools to solve real life problems. It is hence not surprising that such a tool for tools, a beautiful metatool or megatool, is indeed effective downstairs.

The face of science is changing gradually. The phrase of the great physicist Stephen Hawking, ‘I think the next century will be the century of complexity,’ in his ‘millennium’ interview on January 23, 2000 (San Jose Mercury News) became a widely cited prophecy. The idea of ‘complexity’ requires perhaps even more elaborated clarification than Einstein’s ‘flight from miracle’. Large nonlinear systems and ‘emerging’ phenomena have been studied long time ago. More than two thousand years ago Aristotle wrote that ‘the whole is something besides the parts’ (Metaphysics, Book 8, Chapter 6). Since then and to date, complexity traditionally is attributed to some objects and phenomena that are being studied. However, as has been advocated in GorbanYabl2013 , to truly understand what is it that constitutes complexity, instead of working with an object or phenomenon itself, analysing human behavior and activities (i.e. the struggle with complexity) arising whilst studying the phenomenon could sometimes be more productive. This leads to a simple yet fundamental conclusion that complexity is the gap between the laws and the phenomena GorbanYabl2013 .

We can imagine a ‘detailed’ model for a phenomenon but due to complexity, we cannot work with this detailed model. For example, we can write the Schrödinger equation for nuclei and electrons, but we cannot use it directly for modeling materials or large molecules. The result of the struggle with complexity is a model that works. This is reminiscent of engineering: a model is a device, and this device must be functional. Different models are needed for different purposes. One may combine the first principles, empirical data, and even conduct dedicated active experiments to create suitably detailed models. These imaginary detailed models are then combined into ‘possible worlds’ satisfying the first-principled laws completely. A crucial question arises: is an observed phenomenon possible in such a possible world?

In the context of mind and brain, examples of such possible worlds as well as the corresponding open questions are well-known (see review Perlovsky2006 ). The variety of proposed answers is impressive. Minsky asked Minsky1988 : ‘What magical trick makes us intelligent?’ and answered: ‘The trick is that there is no trick’ – no new principles are necessary. Quite opposite, Penrose stated that the known physical principles are insufficient for the explanation of intelligence and mind Penrose1994 . But even for somewhat simpler fluid dynamics models in atomistic world the situation is not fully clear yet and famous Hilbert’s problem ‘of developing mathematically the limiting processes, … which lead from the atomistic view to the laws of motion of continua’ is not solved, moreover, the possible solution may be negative Slemrod2018 . These problems form an essential part of the 6th Hilbert’s problem GorbanHilbert2018 .

Many toolboxes have been developed for model reduction, especially in such areas as physical and chemical kinetics, where the variety of models is huge TomlinEtAl1997 ; GorbanModRed2018 . Modern mathematical modelling is impossible without model reduction, and the model should be simplified during development GorbanModRed2018 .

Model reduction is necessary in the analysis of both natural and artificial neural networks too. For example, dimensionality reduction is imperative for defining the ‘genuine dimension’ of the processed signals and images and for development lower-dimensional models. But surprisingly, simplicity of some basic models in neuroscience is not a result of model reduction. It does not come from a fundamental laws either. It does not follow the fundamental schemes from Figs. 

1 a,b. We consider this simplicity as a riddle. Further in this review we detail this riddle, step by step, and discuss its possible solution.

In neuroscience, like in physics, some simple models work reasonably well. Recall, for instance, classical Pavlovian conditioning and Hebb’s rule. But as we go deeper into details of neuronal mechanisms of perception and memory, we expect complexity of models to increase, and the structural-functional relationships between brain structures and observed phenomena to become more sophisticated. Surprisingly, simple models of neuronal mechanisms of perception and memory demonstrate remarkable effectiveness.

Figure 2: Idea of grandmother cell, a neuron that reacts selectively on a pattern: Jennifer Aniston cell, Dodecahedron cell, and Grandmother cell cell.

One basic example, grandmother cells, seemed to be so simplistic that the term was introduced by Lettvin around 1969 in a jocular story about ‘a great if unknown neurosurgeon’, Akakhi Akakhievitch, who deleted concepts from patient’s memory by ablating the corresponding cells Gross2002 . More seriously, this is a hypothesis that there are neurons that react selectively to specific concepts and images: there are grandmother cells, Jennifer Aniston cells, Dodecahedron cells of even Grandmother cell cells (the cells, that react selectively on the pattern or even the idea of grandmother cell) (see Fig. 2). Two years before Lettvin’s joke, the very similar concept of gnostic cells was introduced by Konorski in his research monograph Konorski1967 .

There are several attempts to give a more formal definition of grandmother (or concept) cell. These definitions differ in their relation to real brain and these differences stimulate intensive discussions Bowers2009 ; Bowers2010 ; Plaut2010 ; QuianQuirogaKreiman2010 ; QuianQuirogaKreiman2010a .

Figure 3: Schematic representation of the idea behind idealised ‘concept cells’: high-dimensional input signal (dimension ) after preprocessing (dimension ), arrives to an ensemble of non-interacting concept cells. The number of concept cells is much larger than (and may be even larger than ). Each concept cell reacts to a single concept in various presentations: visual, verbal, or even on imagination. A source of sparsity could be due to restrictions on the number of inputs of a single concept cell and on the number of links from one component of the preprocessed signal to different concept cells.

The idea about one-to-one correspondence ‘one concept – one cell’ told in Lettvin’s jocular story about Akakhi Akakhievitch and grandmother cell is hard (if at all possible) to verify in an experiment. And yet, the main hypothetical properties of the idealised concept cells are not radically unusual if their broad interpretations are allowed (Fig. 3):

  • Selectivity – the cell reacts to a single specific concept. Of course, it is impossible to exhaustively test all concepts, and precise meaning and interpretation of that constitutes the cell’s reaction is to be defined. Moreover, association of concepts is also possible, and the formal notion of a single concept in brain study is not strictly defined too. Nevertheless, psychologists and brain scientists successfully work with this concept of a concept).

  • Invariance – the cell reacts to various representation of the same concept, including different visual images, verbal representation and even imagination.

  • Different concept cells receive components of the same vector of preprocessed signal.

  • The number of different concepts can be much larger than the dimension of the preprocessed signal (and even larger than the dimension of the inputs).

The concept of a grandmother cell did not emerge as an outcome of any model reduction procedure applied to a detailed model. The brain modelling does not follow the model engineering ‘struggle with complexity’ scheme (Fig. 1 b): there is no complex model that follows the basic laws but does not work. On the contrary, the grandmother cell is a simple phenomenological construction that has been extracted from the discussion of the empirical world and used as a basic brick in modeling. Everyone understands the limitations of this simple model and further development adds more specific details.

This methodology seems to be different from modern physics:

  • In physics, we can imagine a ‘detailed model’ that follows the basic laws. We consider a ‘possible world’, that is the Universe, which follows the basic laws in full details. The notion of possible worlds Hintikka1967 was introduced by Leibnitz for semantics of science. Of course, even here we do not have a complete self-consistent system of laws, but there is a web of standard models GorbanHilbert2018 with some contradictions that could be more of less hidden. Therefore, strictly speaking, even in physics we cannot speak of possible worlds, but rather of ‘impossible possible worlds’ Hintikka1975 ; GorbanHilbert2018 . However, contradictions, such as the quantum gravity problem, seem to be far from most real life problems, and a possible world of physics can be presented along with the ideal ‘detailed model’ for further simplification.

  • In neuroscience, a detailed physical model of the brain is not conceivable if we want it to be the model of the brain and do not like to consider the model of trivial generality such as some unknown ensemble of molecules, reactions and flows. Instead of a detailed model of the brain, a collection of elementary components like the grandmother cell are used, and the system of models is created. This system is developed further in various directions: to the upper levels, with the necessary specification and addition of new details to describe and explain the integrative activity of the brain, and to more physical and technical levels, to decipher the physical and chemical mechanisms of elementary processes in the brain.

Simple models are at the centre of the developing system of models, and many of these simple models are grandmother cells or very simple neural ensembles with a certain semantic function. These simple models were not produced by reduction of more detailed models. They also do not represent the basic fundamental laws. This simplicity without simplification was guessed from observations and scientific speculations about the brain functioning and subsequently confirmed by various experiments that required at the same time modifications of original models. Such an approach, without systematic appellation to basic laws and detailed models is also widely used in physics when the basic laws are largely unknown (see the classification of models in physics proposed by Peierls Peierls1980 ).

The idea that ‘each concept – each person or thing in our everyday experience – may have a set of corresponding neurons assigned to it’ was supported by experimental evidence and became very popular QuianQuirogaSciAm2013 . Of course, the ‘pure grandmother cells’ are considered now as the ultimate abstraction. If two concepts appear often together, then the cell will react to both. Moreover, this association could be remembered from a single exposure IsonQQ2015 . For example, the neuroscientists will associate Jennifer Aniston with the concept of grandmother cell after the famous series of experiments, which used pictures of this actress QuianQuirogaNature2005 . If somebody will see Jennifer Aniston with Dodecahedron in her hands then he (or she) will associate these images for a long time, and the corresponding Jennifer Aniston cell could be, at the same time, the Dodecahedron cell. Interaction between concepts can be also negative, for example, it was reported that in some experiments the Jennifer Aniston unit did not respond to pictures of Jennifer Aniston together with Brad Pitt: one image can outshine the other in perception QuianQuirogaNature2005 .

Deciphering of the concepts hidden in the activity of populations of neurons is a fascinating task. It requires combination of statistical methods and information theory PougetEtAl2000 ; QuianQuirogPanzeri2009 .

The further complications can not overshadow the simplicity of the main construction: the idea of grandmother cell is unexpectedly efficient. Moreover, it is very close to experiment because the study of single-cell responses to various stimuli or behaviours is the basis of experimental neuroscience despite the obvious expectations that the brain makes decisions by processing the activity of large neuronal populations QuianQuirogPanzeri2009 . After decades of development of the initial idea, the neural coding of concepts is considered as sparse but it is not a single-cell coding QuianQuirogaTrends2008 .

The brain processes intensive flow of high-dimensional information. For example, simulation of the realistic-scale brain models IzhikevichEdelman2008 reported in 2005 involved neurons and almost synapses. One second of simulation took 50 days on a beowulf cluster of 27 processors, 3GHz each.

A natural question arises: is there a fundamental reason for the emergence of single grandmother cells or small neuronal ensembles with a certain semantic function in such a multidimensional brain system and information flow? We aim to answer this question and to demonstrate that the answer is likely to be ‘yes’: There are very deep reasons for the appearance of such elementary structures in a high-dimensional brain. This is a manifestation of a general blessing of dimensionality phenomenon.

The blessing of dimensionality is based on the theory of measure concentration phenomena Bal1997 ; GianMilman2000 ; Gromov2003 ; Ledoux2005 ; Talagrand1995 and the stochastic separation theorems GorbanGrechukTykin2018 ; GorbTyu2017 ; GorbanTyuRom2016 . These results form the basis of the ‘blessing of dimensionality’ in machine learning AndersonEtAl2014 ; Donoho2000 ; Donoho and Tanner (2009); GorbanGrechukTykin2018 ; GorbanTyuRom2016 ; Kainen1997 .

Measure concentration phenomena were discovered in statistical physics. The Gibbs theorem about equivalence of ensembles is one of the first results in this direction: the microcanonical ensemble (equidistribution on an isoenergetic surface in an invariant measure, that is, the phase volume in an infinitesimally thin layer) is equivalent to the canonical distribution that maximizes the entropy Gibbs1902 under some assumptions about regularity of the interaction potential. The simplest geometric particular case of this theorem is: equidistribution in a ball is equivalent both to the equidistribution on the sphere (Fig. 4

) and to the Gaussian distribution with the same expectation of

, where are the coordinates.

Figure 4: Volume of a high-dimensional ball is concentrated near its border (sphere). Here, is the volume of the -dimensional ball of radius .

Maxwell used the concentration of measure phenomenon to prove the famous Maxwell distribution. Consider a rotationally symmetric probability distribution on the

-dimensional unit sphere. Then, its orthogonal projection on a line is close to the Gaussian distribution with small variance

(for large

, with high accuracy). This is exactly the Maxwellian distribution for one degree of freedom in a gas. The distribution on the unit sphere is here the microcanonical distribution of the kinetic energy of a gas, when the potential energy is negligible. Geometrically, this means that the ‘observed diameter’ in one-dimensional projections of the unit sphere is small. It is of the order of

.

Lévy Levy1951 took this phenomenon very seriously. He was the first mathematician who recognised it as a very important property of the geometry of multidimensional spaces. He noticed that instead of orthogonal projections on a straight line we can use any -Lipschitz function (with ). Let points be distributed on a unit -dimensional sphere with rotationally symmetric probability distribution. Then, the values of

will be distributed ‘not more widely’ than a normal distribution around the

median value of , : for all

where is a constant, . From the statistical mechanics point of view, this Lévy Lemma describes the upper limit of fluctuations in a gas for an arbitrary observable quantity . The only condition is the sufficient regularity of (Lipschitz property).

Khinchin created mathematical foundations of statistical mechanics based on the central limit theorem and concentration inequalities

Khinchin1949 . This is one of several great ‘Mathematical Foundations’ aimed to answer the challenge of Hilbert’s 6th problem Gorban and Tyukin (2018).

‘Blessing of dimensionality’ is the recently coined term for situations where high-dimensional complexity makes computation easier Kainen1997 . It is opposite to the famous ‘curse of dimensionality’. The curse and blessing of dimensionality are two sides of the same coin.

Two unit random vectors in high dimensional space are almost orthogonal with high probability. This is a simple manifestation of the so-called waist concentration Gromov2003

. Nearly all area of a high-dimensional sphere is concentrated near its equator. This is obvious: just project a sphere onto a hyperplane and use the concentration argument for a ball on the hyperplane (with a simple trigonometric factor). This seems to be highly non-trivial if we ask: near which equator? The answer is obvious but counter-intuitive: near each equator. There are exponentially large (in dimension

) sets of almost orthogonal vectors (with small value of inner products) in -dimensional Euclidean space Kurkova1993 . This ‘quasiortogonality’ was used for indexing in high-dimensional data bases Hecht-Nielsen1994 . Applications of quasiorthogonality to data separation problems are presented in Section ‘Quasiorthogonal sets and Fisher separability of not i.i.d. data’ of GorbanGolubGrechTyu2018 .

Moreover, exponentially large numbers of randomly and independently chosen vectors from equidistribution on a sphere (and from many other distributions) are almost orthogonal with probability close to one GorbTyuProSof2016 . The probabilistic approach to quasiorthogonality is proven to be useful for construction of dimensionality reduction by random mapping Kaski1998 ; RitterKohonen1989 .

The classical concentration of measure theorems state that independent identically distributed (i.i.d.) random points are concentrated in a thin layer near a surface. This layer can be a sphere or an equator of a sphere, an average or median level set of energy or another Lipschitz function, etc.

The novel stochastic separation theorems describe the fine structure of these thin layers GorbTyu2017 : each point from a finite random set is linearly separable from with high probability, even for exponentially large random sets . The linear functionals for separation of points can be selected in the form of the simplest linear Fisher’s discriminant. For example, points sampled independently from the equidistribution in the 100-dimensional unit ball have the following property with probability : all and

where is the standard inner product and is the Euclidean norm.

Stochastic separation theorems hold for a much wider class of probability distributions than just equidistributions in a ball or Gaussian distributions. The requirement of i.i.d. samples can be significantly relaxed GorbanGrechukTykin2018

. Other simple decision rules can be used instead of Fisher’s linear discriminants. For example, simple radial basis functions and Gaussian mixtures can also be easily implemented for separation of points in high dimensions

AndersonEtAl2014 . In most practical situations we have encountered, linear separation seems to be more useful than radial basis functions, since linear discriminants combine the ability of stochastic separation and the reasonable ability of generalization.

The main requirement for a data point distribution is that it must be significantly multidimensional. Some effort was needed to answer the question: what distributions are essentially multidimensional in order to guarantee the stochastic separation properties? A special case of this question was formulated in 2009 by Donoho and Tanner as an open problem Donoho and Tanner (2009). The answer GorbanGrechukTykin2018 is based on the idea: the sets of relatively small volume should not have relatively high probability, with detailed specification of these ‘relatively small’ and ‘relatively high’ in the conditions of the theorems.

Thus, ‘high-dimensional data’ does not simply mean ‘data with many coordinates’. For example, data, located on a straight line, low-dimensional plane, or a smooth curve with bounded (and not very large) curvature in high dimensional space are not high-dimensional data. Then, a special preprocessing is needed to prepare the data for application of simple separating elements, like threshold elements or single neurons. In this preprocessing, the data should be compressed to their genuine dimension. Stochastic separation is possible if the genuine dimension is sufficiently high. This preprocessing and the evaluation of the genuine dimension requires special techniques. In machine learning it is a combination of various linear and nonlinear dimensionality reduction methods GorbanKegl2008 ; GorZin2010 , compressed sensing Donoho2006 ; Eldar2012

, autoencoders

Kramer1991 ; VincentBengio2008 and other approaches. To the best of our knowledge, this preprocessing cannot be performed by ensembles of simple noninteracting elements.

Thus, in high dimensional data flow, many concepts can be separated by simple elements (or neurons). This result gives grounds for the convergent evolution of artificial and natural intelligence despite of very different timescale of their changes. They use simple elements in a similar way.

One can expect that a system of perception and recognition consists of two subsystems. The first subsystem preprocesses data, compresses them and extracts features. The second subsystem recognises patterns and returns the relevant concepts. The preprocessing system is expected to be strongly connected, while the recognition system can consist of simple noninteracting elements (grandmother cells). These elements receive preprocessed data and give the relevant response. The blessing of dimensionality phenomena are utilised by the second system, whose elements are simple, do not interact, and recall the concept cells AndersonEtAl2014 .

A more general architecture was proposed recently for AI systems. The layer of noninteracting ‘concept cells’ or the cascade of such layers grows as a system for correction of errors of a legacy intellectual system. Any AI system makes errors. The successful functioning of a system in realistic operational conditions dictates that the errors should be detected and corrected immediately. Real-time correction of mistakes by re-training is not always viable due to the resources involved. Moreover, the re-training could introduce new mistakes and damage existing skills. Correction of errors is becoming an increasingly important and challenging problem in AI technology.

Figure 5: Corrector of AI errors. The legacy AI system is represented as a network of elementary units. Inputs for this corrector may include input signals, and any internal or output signal of the AI system (marked by circles).

The recently proposed architecture of such an elementary corrector GorbanTyuRom2016 consists of two ideal devices (Fig. 5):

  • A binary classifier for separation of the situations with possible mistakes form the situations with correct functioning (or, more advanced, separation of the situations with high risk of mistake from the situations with low risk of mistake);

  • A new decision rule for the situations with possible mistakes (or with high risk of mistakes).

One corrector can fix several errors (it is useful to cluster them before corrections). Several correctors can work independently with low probability of conflict. Cascades of correctors are employed for further correction of more errors Gorban and Tyukin (2018): the AI system with the first corrector is a new legacy AI system and can be corrected further (Fig. 6).

Figure 6: Cascade of AI correctors. In this diagram, the original legacy AI system (shown as Legacy AI System 1) is supplied with a corrector altering its responses. The combined new AI system can in turn be augmented by another corrector, leading to a cascade of AI correctors.

Fascinating speculations and hypotheses about correctors in the evolution of the brain can be suggested, but here we avoid this topic and try not to deviate significantly from strictly proven theorems or results supported by experiments or computational experiments. Effectiveness of one-trial correctors of AI errors is supported both by the theorems and the computational experiments. Existence and selectivity of concept cells has been proven experimentally. Thus, we observe the simplicity revolution in neuroscience that started from invention of grandmother cells and approached recently the theory and technology of one-short learning of AI correctors.

In Sec. 2 we review the results about grandmother cells, concepts cells and space coding. Blessing of dimensionality and stochastic separation theorems are presented in Sec. 3. Computational experiments with correctors of legacy AI systems, knowledge transfer between AI systems using correctors, and cascades of correctors are discussed in Sec. 3.5. In Sec. 4 we return to brain dynamics and present a model of appearance of codifying memory in stratified brain structures such as the hippocampus following TyukinBrain2017 . We show how single neurons can selectively detect and learn arbitrary information items, given that they operate in high dimensions. These results constitute a basis for organization of complex memories in ensembles of single neurons.

2 Grandmother cells, concept cells, and specific coding

Three years after the term ‘grandmother cell’ was proposed, Barlow published an essay about the single neuron revolution in sensory psychology Barlow1972 . His central proposition was: ‘our perceptions are caused by the activity of a rather small number of neurons selected from a very large population of predominantly silent cells.’ It was described in more detail in the ‘five dogmas’ of a single cell perception:

  1. To understand nervous function one needs to look at interactions at a cellular level, rather than either a more macroscopic or microscopic level, because behaviour depends upon the organized pattern of these intercellular interactions.

  2. The sensory system is organized to achieve as complete a representation of the sensory stimulus as possible with the minimum number of active neurons.

  3. Trigger features of sensory neurons are matched to redundant patterns of stimulation by experience as well as by developmental processes.

  4. Perception corresponds to the activity of a small selection from the very numerous high-level neurons, each of which corresponds to a pattern of external events of the order of complexity of the events symbolized by a word.

  5. High impulse frequency in such neurons corresponds to high certainty that the trigger feature is present.

He presented a collection of examples of single cell perception found experimentally. In all these examples neurons reacted selectively to the key patterns (called ’trigger features’) and this reaction was invariant to the change of conditions like variations in light intensity or presence of noise. Such sensory neurons can be located in various places, for example, in retina or in mammalian cerebral cortex. The trigger features of the retinal ganglion cells are the key patterns of the visual information. They are more elementary than the objects or concepts (like ‘grandmother’ or ‘jedi’). The higher order concepts are combined from these trigger features. Therefore, the representation of objects and concepts by the retina can be considered as distributed or implicit. They are coded explicitly at higher level of the visual information processing stream QuianQuirogaKreiman2010 .

The status of these five dogmas was defined as a simple set of hypotheses that were compatible with known facts. Barlow also criticised the concept of grandmother cell because it neglects the connection between one perception and others. He suggested that unique events are not isolated in our perception, they overlap with each other and the concept of grandmother cell makes this continuity impossible.

Discussion about the role of single cells in perception has a long history. James in his book James1890 (Chapter VI, p. 179) considered the possibility to revive Leibnithz’s multiple monadism (or polyzoism): ‘Every brain-cell has its own individual consciousness, which no other cell knows anything about, all individual consciousness being ’ejective’ to each other. There is, however, among the cells one central or pontifical one to which our consciousness is attached’ (following the term ‘grandmother cell’, we can call the pontifical cell the ‘I-cell’). Sherrington criticises the idea of pontifical cell and proposed the ‘million-fold democracy whose each unit is a cell’ Sherrington1941 . He stated that the perception of any specific object is performed by the joint activity of many millions of neurons.

Barlow proposed the concept of ‘cardinal cells’. This is the idea of sparse coding: ‘Among the many cardinals only a few speak at once’ Barlow1972 . He also attracted attention of readers to the fact that the number of cortical neurons in area 17 is orders of magnitude greater than the number of incoming fibres (area 17, or V1, or striate cortex is the end organ of the afferent visual system and is situated in the occipital lobe). This can be considered as a sign that there are many more neurons for codifying memory and perception than the dimension of the visual inputs into this area. The hierarchy of sensory neurons is very different form the church hierarchy : there are many more cardinal cells then the ordinary ‘church members’.

A series of experiments demonstrated that neurons in the human medial temporal lobe (MTL) fire selectively to images of faces, animals, and other objects or scenes Fried1997 ; Kreiman2000 ; QuianQuirogaNature2005 . It was demonstrated that the firing of MTL cells was sparse because most of them did not respond to the great majority of images used in the experiment. These cells have low baseline activity and their response is highly selective QuianQuirogaNature2005 . ‘Jennifer Aniston’ cells responded to pictures of Jennifer Aniston but rarely of very weekly to pictures of other persons. An important observation was: these neurons respond also to the printed name of the person. Therefore, they react not only on the visual likeness but on the concept. Moreover, the voluntary control of these neurons is possible via imagination: it is sufficient to imagine the concept or ‘continuously think of the concept’ Cerf2010 .

These discoveries clearly demonstrate fantastic selectivity of the neuronal response. At the same time, the ‘ideal’ grandmother cells with one-to-one relations between the concept cells and the concepts were not found:

  • There are several concept cells for each concept. A single cell is impossible to find, therefore, the fact that these cells were found proved that there were sufficiently many ‘Jennifer Aniston’ cells, ‘Brad Pitt’ cells, etc.

  • One cell can fire for different concepts – this is association between concepts. For example, the ‘Jennifer Aniston’ cells also fired to Lisa Kudrow, a costar in the TV series Friends, and ‘Luke Skywalker’ cell also fired to Yoda QuianQuirogaSciAm2013 . This means that the concept extracted by these units are wider than just single name or character. Nevertheless, the presence of the individual can, in most cases, be reliably decoded from a small number of neurons.

  • Redistribution of attention between different parts of the image can have effect opposite to association. For example, the ‘Jennifer Aniston’ cell did not react on the picture, where Jennifer Aniston is together with Brad Pitt. It is recognised rather as Brad Pitt.

One can modify the concepts behind the images and insist that there are proper ‘grandmother cells’ (may be several for each concept) and we should just properly define the concept of ‘grandmother’. For example, the units, which react both on Luke Skywalker and Yoda, can be called the ‘Famous Jedi’ cells, whereas the units, which fire both to Jennifer Ariston and Lisa Kudrow are ‘The stars of the TV series Friends’ cells. There was an intensive discussion about the definition of the proper grandmother cells Bowers2009 ; Bowers2010 ; Plaut2010 ; QuianQuirogaKreiman2010 ; Roy2013 , which was ended by the analysis of the notion, that was necessary because ‘a typical problem in any discussion about grandmother cells is that there is not a general consensus about what should be called as such’ QuianQuirogaKreiman2010a .

A ‘neuron-centered’ idea of concepts is possible: a concept is a pattern of the input information that evokes a specific selective response of a group of cells. In this sense, the grandmother cells exist. Conventional notion of ‘concepts’ is different: these ‘concepts’ are used in communication between people and as such are the compromises between neuron-centered concepts in different individuals. Convergence of these different worlds of concepts is also possible because the concept cells themselves can learn rapidly. Timescales of this learning is compatible with that of the episodic memory formation. Classical experiments revealing such learning in hippocampal neurons were presented in 2015 IsonQQ2015 . A cell was found that actively responded to presentation of the image of the patient’s family and did not respond to the image of the Eiffel tower before learning. In the single trial learning the composite picture, the family member at the Eiffel tower, was exposed once. After that, the firing rate of the cell in response to the Eiffel tower increased significantly, the response to the image of the family member did not change significantly, and it remained similar to the response to the composite image. This plasticity and rapid learning of associations do not fully correspond to the ‘monadic’ idea of the grandmother’s cell or to the idea of ‘concept cell’ with rigorously defined concepts. The discussion about terminology and interpretation could continue further.

Nevertheless, without any doubt and without re-definition of the concepts, the experiments with single-cell perception show that there are (relatively) small sets of neurons that react actively and selectively on the input images associated with a small group of interrelated concepts (and do not react actively on other images). That is, the small neural ensembles separate the images, associated with a group of associated concepts, from other images.

There are several other paradigms of coding of sensory information in brain. Distributed coding as opposite to the local one is used for coding and prediction of motion Eichenbaum1993 ; Wilson1993 . Correlation between different neurons are important for information decoding at the cell population level even if the pair correlations are rather low Averbeck2006 . The balance between single cell and distributed coding remains a non-trivial problem in neuroscience Valdez2015 . The intermediate paradigm of sparse coding suggests that sensory information is encoded using a small number of neurons active at any given time Olshausen2004 (only a few cardinals speak at once). Sparse coding decreases ‘collisions’ (less intersections) between patterns. It is also beneficial for learning associations. According to Barlow Barlow1972 , the nervous system tends to form neural coding with higher degrees of specificity, and sparse coding is just an intermediate step on the way to achieve maximal specificity.

In the next section we demonstrate, why highly specific coding by simple elements (concept cells) is possible and beneficial for high-dimensional brain in high-dimensional world and explain what ‘high-dimensional’ means in this context.

3 Blessing of dimensionality: concentration of measure and stochastic separation theorems

In the sequel, we use the following notations: is the -dimensional linear real vector space, denote elements of , is the standard inner product of and , and is the Euclidean norm in . Symbol stands for the unit ball in centered at the origin, . is the -dimensional Lebesgue measure, and is the volume of a unit ball. is the unit sphere in . For a finite set , the number of points in is .

3.1 Fisher’s separability of a point from a finite set

Separation of a point from a finite set is a basic task in data analysis. For example, for correction of a mistake of an AI system we have to separate the situation with mistake from a relatively large set of situations, where the system works properly.

In high dimensions simple linear Fisher’s discriminant demonstrates the surprisingly high ability to separate points. After data whitening, Fisher’s discriminant can be defined by a simple linear inequality:

Definition 1.

A point is Fisher separable from a finite set with a threshold () if

(1)

for all .

Whitening is a a special change of coordinates that transforms the empiric covariance matrix into identity matrix. It can be represented in four steps:

  1. Centralise the data cloud (subtract the mean from all data vectors), normalise the coordinates to unit variance and calculate the empiric correlation matrix.

  2. Apply principal component analysis (i.e. calculate eigenvalues and eigenvectors of empiric correlation matrix).

  3. Delete minor components, which correspond to the small eigenvalues of empiric correlation matrix.

  4. In the remained principal component basis normalise coordinates to unit variance.

Deletion of minor components is needed to avoid multicollinearity, i.e. strong linear dependence between coordinates. Multicollinearity makes the models sensitive to fluctuations in data and unstable. The condition number of the correlation matrix is the standard measure of multicollinearity, that is the ratio , where and are the maximal and the minimal eigenvalues of this matrix. Collinearity is strong if . If then collinearity is considered as modest and most of the standard regression and classification methods work reliably Dormann2013 .

Whitening has linear complexity in the number of datapoints. Moreover, for approximate whitening we do not need to use all data and sampling makes complexity of whitening essentially sublinear.

It could be difficult to provide the precise whitening for intensive real life data streams. Nevertheless, a rough approximation to this transformation creates useful discriminants (1) as well. With some extension of meaning, we call ‘Fisher’s discriminants’ all the linear discriminants created non-iteratively by inner products (1).

Definition 2.

A finite set is called Fisher-separable with threshold if inequality (1) holds for all such that . The set is called Fisher-separable if there exists some () such that is Fisher-separable with threshold .

Using elementary geometry, we can show that inequality (1) holds for vectors , if and only if does not belong to a ball (Fig. 7):

(2)

The volume of ball (2), where the discriminant inequality (1) does not hold, can be relatively small.

Figure 7: Diameter of the filled ball (excluded volume) is the segment . Point should not belong to the excluded volume to be separable from by the linear discriminant (1) with threshold . Here, is the origin (the centre), and is the hyperplane such that for . A point should not belong to the union of such balls for all for separability from a set .

For example, if is a subset of , then the volume of each ball (2) does not exceed

. To use further this simple estimate assume that

. Point is separable from a set by Fisher’s linear discriminant with threshold if it does not belong to the union of these excluded balls. The volume of this union does not exceed

(3)

If with then the fraction of excluded volume in the unit ball decreases exponentially with dimension as .

Proposition 1.

Let , be a finite set, , and be a randomly chosen point from the equidistribution in the unit ball. Then with probability point is Fisher-separable from with threshold (1).

This proposition is valid for any finite set and without any hypotheses about its statistical nature. No underlying probability distribution is assumed for points of . For , Proposition 1 assumes equidistribution in a ball. Of course, the class of distributions satisfying this property is much larger. It is required that the probability to find in a small volume (3) is small and vanishes when dimension . Assume that the number of points can grow with dimension and is bounded by a geometric progression, , . The simplest condition on the distribution of can be formulated for the probability density. Let the probability distribution in a unit ball have the continuous density , and . Then the probability to find a random point in the excluded volume (3) does not exceed

For the equidistribution in a ball, . Consider the distributions that can deviate significantly from the equidistribution, and these deviations can grow with dimension but not faster than the geometric progression with the common ratio :

(4)

here does not depend on .

For such a distribution in the unit ball, the probability to find a random point in the excluded volume (3) tends to 0 when as a geometric progression with the common ratio . We now formulate this important fact as a theorem.

Theorem 1.

Assume that a probability distribution in the unit ball has a density with maximal value , which satisfies inequality (4). Let and . Then the probability that a random point is Fisher-separable from the finite set is , where

Two restriction on the distribution of were used: the absence of large deviations (the distribution has a bounded support contained in a unit ball) and the boundedness of the density (4). The first restriction allowed us to consider the balls of excluded volume (Fig. 7) with bounded diameter . The second restriction guarantees that the probability to find a point in such a ball is small. The essence of this restriction is: the sets of relatively small volume have relatively small probability. For the whitened data this means that the distribution is essentially -dimensional and is not concentrated near a low-dimensional nonlinear ‘principal object’, linear or nonlinear one GorbanKegl2008 ; GorZin2010 ; ZinMir2013 .

For analysis of Fisher-separability for general distribution, an auxiliary random variable is useful. This is the probability that a randomly chosen point

is not Fisher-separable with threshold from a given data point by the discriminant (1) (i.e. the inequality (1) is false):

(5)

where is the probability measure for . For example, for the equidistribution in a ball and arbitrary from the ball, and . The probability to select a point inside the union of ‘excluded balls’ is less than for any sampling of points in .

If is a random point (not compulsory with the same distribution as ) then is also a random variable. For a finite set of data points the probability that the data point is not Fisher-separable with threshold from the set can be evaluated by the sum of for :

(6)

Evaluation of sums of random variables (dependent or independent, identically distributed or not) is a classical area of probability theory

Zolotarev2011 . The inequality (6) in combination with the known results provide many versions and generalizations of Theorem 1. For this purpose, it is necessary to evaluate the distribution of

and its moments.

Evaluation of the distribution of and its moments does not require the knowledge the detailed multidimensional distribution of . It can be performed over an empirical dataset as well. Comparison of the empirical distribution of with the distribution evaluated for the high-dimensional sphere provides an interesting information about the ‘genuine’ dimension of data. The probability is the same for all and exponentially decreases for large .

Simple estimate gives for the rotationally invariant equidistribution on the unit sphere GorbanGolubGrechTyu2018 :

(7)

Here means that when (note that the functions here are strictly positive).

3.2 A LFW case study

In this Section, we evaluate the moments of for a well known benchmark database LFW (Labeled Faces in the Wild). It is a set of images of famous people: politicians, actors, and singers LFWSurvey2016 . There are 13,233 photos of 5,749 people in LFW, 10,258 images of men and 2,975 images of women. The data are available online LFW .

In preprocessing, photos were cropped and aligned. Then, the neural network system FaceNet was used to calculate 128 features for each image. Facenet was specially trained to map face images into 128-dimensional feature space FaceNet2015 . The training was organised to decrease distances between the feature vectors for images of the same person and increase the distances between the feature vectors for images of different persons. After this step of preprocessing, the 128-dimensional feature vectors were used for analysis instead of the initial images. In this analysis we follow GorbanGolubGrechTyu2018 .

The standard Principal Component analysis (PCA) provided the essential dimension decrease. Dependence of the eigenvalue of the correlation matrix on is presented in Fig. 8.

Figure 8: Eigenvalue of the correlation matrix for the LFW benchmark as a function of the number of principal component. The verticaL line separates first 57 eigenvalues.

There exist various heuristics for selection of principal components to retain

Cangelosi2007 .The plot on Fig. 8 shows that not more than 63 principal components explain practically 100% of data variance. We applied multicollinearity control for selection of main components. The recommended border of modest collinearity is , where and and are the maximal and the minimal eigenvalues of the correlation matrix Dormann2013 . Thus, we retained 57 principal components with the eigenvalues of covariance matrix .

Whitening was applied after projection of feature vectors onto these 57 principal components. After that, the covariance matrix became identity matrix. In the basis of principal components, whitening is a coordinate transformation: , where is the projection of the data vector on th principal component, and is the eigenvalue of the covariance matrix, corresponding to the th principal component.

Optionally, an additional preprocessing operation, the projection on the unit sphere, was applied after whitening. This is the normalization of each data vector to unit length (just take instead of ).

The results of computation are presented in Table 1 GorbanGolubGrechTyu2018 . Here, is the threshold in Fisher’s discriminant (1), is the number of points, which cannot be separated from the rest of the data set by Fisher’s discriminant (1) with given , is the fraction of these points in the data set, and is the mean value of .

0.8 0.9 0.95 0.98 0.99
Separability from all data
4058 751 123 26 10
0.3067 0.0568 0.0093 0.0020 0.0008
9.13E-05 7.61E-06 8.91E-07 1.48E-07 5.71E-08
Separability from points of different classes
55 13 6 5 5
0.0042 0.0010 0.0005 0.0004 0.0004
3.71E-07 7.42E-08 3.43E-08 2.86E-08 2.86E-08
Separability from all data on unit sphere
3826 475 64 12 4
0.2891 0.0359 0.0048 0.0009 0.0003
7.58E-05 4.08E-06 3.66E-07 6.85E-08 2.28E-08
Separability from points of different classes on unit sphere
37 12 8 4 4
0.0028 0.0009 0.0006 0.0003 0.0003
2.28E-07 6.85E-08 4.57E-08 2.28E-08 2.28E-08
Table 1: Separability of data points by Fisher’s linear discriminant in the preprocessed LFW data set.

We cannot expect an i.i.d. data sampling for a ‘good’ distribution to be an appropriate model for the LFW data set. Images of the same person are expected to have more similarity than images of different persons. It is expected that the set of data in FaceNet coordinates will be even further from the i.i.d. sampling of data, because it is prepared to group images of the same person together and ‘repel’ images of different persons. Consider the property of image to be ‘separable from images of other persons’. Statistics of this ‘separability from all points of different classes’ by Fisher’s discriminant is also presented in Table 1. We call the point inseparable from points of different classes if at least for one point of a different class (image of a different person) . We use stars in Table 1 for statistical data related to separability from points of different classes (i.e., use the notations , , and ). When approaches 1, both and decay, and the separability of a point from all other points becomes better (see Table 1).

It is not surprising that this separability of classes is much more efficient, with less inseparable points. Projection on the unit sphere also improves separability.

It is useful to introduce some baselines for comparison: what values of should be considered as small or large? Two levels are obvious: for the experiments, where all data points are counted in the excluded volume, we consider the level as the critical one, here is the number of different persons in the database. For the experiments, where only images of different persons are counted in the excluded volume, the value seems to be a good candidate to separate the ‘large’ values of from the ‘small’ ones. For LFW data set, 1.739E-04 and 7.557E-05. Both levels have been achieved already for . The parameter can be considered for experiments with separability from points of different classes as the separability error. This error is extremely small: already for it is less than 0.5% without projection on unit sphere and less than 0.3% with this projection. The ratio can be used to evaluate the generalisation ability of the Fisher’s linear classifier. The nominator is the number of data points (images) inseparable from some points of the same class (the person) but separated from the points of all other classes. For these images we can say that Fisher’s discriminant makes some generalizations. The denominator is the number of data points inseparable from some points of other classes (persons). According to the proposed indicator, the generalization ability is impressively high for : it is 72.8 for preprocessing without projection onto unit sphere and 102.4 for prerpocessing with such projection. For it is smaller (56.8 and 38.6 correspondingly) and decays fast with further growing of .

Figure 9: Dependence of and on . Symbols corresponds to the preprocessed LFW data set projected on the unit sphere (data from Table 1): pluses, , are used for

and skew crosses,

, are used for . Solid lines – dependencies of on for equidistribution on the unite sphere (7) in different dimensions, from to , from the top down. (a) Original database with saturation at ; (b) The database after fixing two labeling mistakes. Pluses and crosses on the bottom borderline correspond to zero values.

In the projection on the unit sphere, when . This is a trivial separability on the border of a strongly convex body where each point is an extreme one. The rate of this convergence to 0 and the separability for depend on the dimension and on the intrinsic structures in data (multicluster structure, etc., Fig. 9). Compare the observed behaviour of and for the LFW data set projected on the unit sphere to for equidistribution on the unit spere (7). We can see in Fig. 9 (a) that the values of for the preprocessed LFW database projected on the unit sphere correspond approximately to the equidistribution on the sphere in dimension 16, and for this effective dimension decreases to 14. For we observe higher effective dimensions: it is approximately 27 for and 19 for . There is a qualitative difference between behaviour of for empirical database and for the equidistribution. For the equidistribution on the sphere, decreases approaching the point from below like . In the logarithmic coordinates it should look like , exactly as we can see in Fig. 9. The behavior of for the LWF database is different and demonstrates some saturation rather than decay to zero. The human inspection of the inseparability at shows that there are two pairs of images, which completely coincide but are labeled differently. This is an error in labeling. After fixing this error, the saturation vanished (Fig. 9 (b)) but the obvious difference between curves for the empirical data and the data for equidistribution on the sphere remained. We suppose that this difference is caused by the multicluster structure of the preprocessed LWF database.

3.3 Fisher’s separability for different probability distributions

First stochastic separation theorems about Fisher’s separability were proved for points randomly and independently sampled from an equidistribution in a ball GorbanTyuRom2016 . These results were extended later to the product distributions in a cube GorbTyu2017 .

Theorem 1 and two following theorems GorbTyu2017 demonstrate that Fisher’s discriminants are powerful in high dimensions.

Theorem 2 (Equidistribution in ).

Let be a set of i.i.d. random points sampled from the equidustribution in the unit ball . Let and . Then

(8)
(9)
(10)

Inequalities (8), (9), and (10) are also closely related to Proposition 1. According to Theorem 2, the probability that a single element from the sample is linearly separated from the set by the hyperplane is at least

This probability estimate depends on both and dimensionality . An interesting consequence of the theorem is that if one picks a probability value, say , then the maximal possible values of for which the set remains linearly separable with probability that is no less than grows at least exponentially with . In particular, the following corollary holds

Corollary 1.

Let be a set of i.i.d. random points from the equidustribution in the unit ball . Let , and . If

(11)

then If

(12)

then

In particular, if inequality (12) holds then the set is Fisher-separable with probability .

Note that (10) implies that elements of the set are pair-wise almost or -orthogonal, i.e. for and all , , with probability larger or equal than . Similar to Corollary 1, one can conclude that the cardinality of samples with such properties grows at least exponentially with . The existence of the phenomenon has been demonstrated in Kurkova1993 . Theorem 2, Eq. (10), shows that the phenomenon is typical in some sense (cf. GorbTyuProSof2016 , Kůrková and Sanguineti (2017)).

The linear separability property of finite but exponentially large samples of random i.i.d. elements is not restricted to equidistributions in a ball . As has been noted in Gorban, Romanenko et al. (2016), it holds for equidistributions in ellipsoids as well as for the Gaussian distributions. Moreover, it was generalized to product distributions in a unit cube. Consider, e.g. the case when coordinates of the vectors in the set are independent random variables , with expectations and variances . Let for all . The following analogue of Theorem 2 can now be stated.

Theorem 3 (Product distribution in a cube GorbTyu2017 ).

Let be i.i.d. random points from the product distribution in a unit cube. Let

Assume that the data are centralised and . Then

(13)
(14)

In particular, under the conditions of Theorem 3, set is Fisher-separable with probability , provided that , where and are some constants depending only on and .

The proof of Theorem 3 is based on concentration inequalities in product spaces Talagrand1995 . Numerous generalisations of Theorems 2 and 3 are possible for different classes of distributions, for example, for weakly dependent variables, etc.

We can see from Theorem 3 that the discriminant (1) works without precise whitening. Just the absence of strong degeneration is required: the support of the distribution contains in the unit cube (that is bounding from above) and, at the same time, the variance of each coordinate is bounded from below by .

Various generalisations of these theorems were proved in GorbanGolubGrechTyu2018 , for example, for log-concave distributions. The distribution with density is log-concave if set is convex and is convex function on . It is strongly log-concave if there exists a constant such that

(15)

The distribution is called isotropic if it is centralised and its covariance matrix is identity matrix.

Theorem 4.

Let be a set of i.i.d. random points sampled from an isotropic log-concave distribution in . Then set is Fisher-separable with probability greater than , , provided that

where and are constants, depending only on .

For strongly log-concave distributions (15) we return to the exponential estimates of .

Theorem 5.

Let be a set of i.i.d. random points sampled from an isotropic strongly log-concave distribution in . Then set is Fisher-separable with probability greater than , , provided that

where constants and depend only on from definition of strong long-concavity (15).

Remark 1.

Isotropic Gaussian distribution is strongly log-concave with . Therefore, Theorem 5 can be applied to Gaussian distribution too.

Many other versions of the stochastic separation theorems can be proven. Qualitatively, all these results sound similarly: exponentially large samples can be separated with high probability by Fisher’s discriminants for essentially high-dimensional distributions without extremely heavy tails at infinity.

3.4 Linear separation and SmAC distributions

Linear separation is more general property than separation by Fisher’s discriminant: it requires existence of a linear functional that separates a point from a set. Two approaches to construction linear separation are standard in AI applications: the historically first Rosenblatt perceptron algorithm and the Support Vector Machine (SVM). Of course, Fisher’s discriminant is robust and much simpler in computations but one can expect from SVM more separation ability.

General linear separability was analysed by Donoho and Tanner Donoho and Tanner (2009). They studied -element i.i.d. samples of standard normal distribution in dimension and noticed that if the ratio is fixed and dimension is large then all of the points are on the boundary of the convex hull (of course, this is true even for exponentially large sample sizes and for the stronger property, Fisher’s separability, see our Theorem 5 and Remark 1

). This observation contradicts the intuition. They have studied various distribution numerically, observed the same effects for several highly non-i.i.d. ensembles too, and formulated the open problem: ‘Characterize the universality class containing the standard Gaussian: i.e. the class of matrix ensembles leading to phase transitions matching those for Gaussian polytopes.’

The standard approach assumes that the random set consists of independent identically distributed (i.i.d.) random vectors. The stochastic separation theorem presented below does not assume that the points are identically distributed. It can be very important: in the real practice the new data points are not compulsory taken from the same distribution than the previous points. In this sense the typical situation with the real data flow is far from an i.i.d. sample (we are grateful to G. Hinton for this important remark). New Theorem 6 (below) gives also an answer to the open problem Donoho and Tanner (2009): it gives the general characterisation of the wide class of distributions with stochastic separation theorems (the SmAC condition below). Roughly speaking, this class consists of distributions without sharp peaks in sets with exponentially small volume (the precise formulation is below). We call this property “SMeared Absolute Continuity” (or SmAC for short) with respect to the Lebesgue measure: the absolute continuity means that the sets of zero measure have zero probability, and the SmAC condition below requires that the sets with exponentially small volume should not have high probability.

Consider a family of distributions, one for each pair of positive integers and . The general SmAC condition is

Definition 3.

The joint distribution of

has SmAC property if there are exist constants , , and , such that for every positive integer , any convex set such that

any index , and any points in , we have

(16)

We remark that

  • We do not require for SmAC condition to hold for all , just for some . However, constants , , and should be independent from and .

  • We do not require that are independent. If they are, (16) simplifies to

  • We do not require that are identically distributed.

  • The unit ball in SmAC condition can be replaced by an arbitrary ball, due to rescaling.

  • We do not require the distribution to have a bounded support - points are allowed to be outside the ball, but with exponentially small probability.

The following proposition establishes a sufficient condition for SmAC condition to hold.

Proposition 2.

Assume that are continuously distributed in with conditional density satisfying

(17)

for any , any index , and any points in , where and are some constants. Then SmAC condition holds with the same , any , and .

If are independent with having density in -dimensional unit ball , then (17) simplifies to

(18)

where and are some constants.

With , (18) implies that SmAC condition holds for probability distributions whose density is bounded by a constant times density

of uniform distribution in the unit ball. With arbitrary

, (18) implies that SmAC condition holds whenever ration grows at most exponentially in . This condition is general enough to hold for many distributions of practical interest.

Example 1.

(Unit ball) If are i.i.d random points from the equidistribution in the unit ball, then (18) holds with .

Example 2.

(Randomly perturbed data) Fix parameter (random perturbation parameter). Let be the set of arbitrary (not compulsory random) points inside the ball with radius in . They might be clustered in arbitrary way or belong to a subspace of low dimension, etc. Let for each be a points, selected uniformly at random from a ball with center and radius . We consider as a “perturbed” version of . Then (18) holds with , .

Example 3.

(Uniform distribution in a cube) Let be i.i.d random points from the equidistribution in the unit cube. Without loss of generality, we can scale the cube to have side length . Then (18) holds with .

Remark 2.

For the uniform distribution in a cube

where means Stirling’s approximation for gamma function .

Theorem 6.

Let be a set of i.i.d. random points in from distribution satisfying SmAC condition. Then is linearly separable with probability greater than , , provided that

where constant depends on , , and , and depends on and

For more technical details, proofs, and estimation of constants we refer to works GorbTyu2017 ; GorbTyuProSof2016 ; GorbanGrechukTykin2018 ; GorbanGolubGrechTyu2018 .

3.5 Correctors of AI legacy systems: implementation and testing

Examples illustrating the theory have been reported in several publications and case studies Gorban, Romanenko et al. (2016); TyuGor2017knowlege

. We discuss some of those results concerning the problem of correcting legacy AI systems, including Convolutional Neural Networks (CNN) trained to detect objects in images. Networks of the latter class show remarkable performance on benchmarks and, reportedly, may outperform humans on popular classification tasks such as ILSVRC

He:2016 . Nevertheless, as experience confirms, even the most recent state-of-the-art CNNs eventually produce an inconsistent behaviour. Success in testing of AIs in the laboratories does not always lead to appropriate functioning in realistic operational conditions.

One of the most pronounced examples of such malfunctioning was published recently Foxx (2018)

. British police’s facial recognition matches (‘positives’) are reported highly inaccurate (more than 90% false positive). The experts in statistics mentioned in the further discussion that ‘figures showing inaccuracies of 98% and 91% are likely to be a misunderstanding of the statistics and are not verifiable’

face2018 . Despite of the details of this discussion, the large number of false positive recognitions of ‘criminals’ leads to serious concerns about security of AI use.

Re-training of large legacy AI systems requires significant computational and informational resources. On the other hand, as theory in the previous sections suggests, these errors can be mitigated by constructing cascades of linear hyperplanes. Importantly, construction of these hyperplanes can be achieved by mere Fisher linear discriminants and hence the computational complexity scales linearly with the number of samples in the data as opposed to e.g. Support Vector Machines whose worst-case computational complexity scales as a cube of the size of the data sample.

3.5.1 Single-neuron error correction of a legacy AI system

In Gorban, Romanenko et al. (2016) we considered a legacy AI system that was designed to detect images of pedestrians in a video frame. A central element of this system was a VGG-11 convolutional network Simonyan:2015

. Such networks have simple homogeneous architecture and at the same time exhibit good classification capabilities in Imagenet competitions

Russ:2015 . The network was successfully trained on a set of RGB images comprising of pedestrians (‘positive’ class) and non-pedestrian (‘negative’ class), re-sized to pixels.

The network was then used to detect pedestrian in the NOTTINGHAM video Nottingham consisting of frames taken with an action camera. The video contains images of pedestrians (identified by human inspection), and neither of these images has been included in the training set. A multi-scale sliding window approach was used on each video frame to provide proposals that can be run through the trained network. These proposals were re-sized to and passed through the network for classification. Non-maximum suppression is then applied to the positive proposals. Bounding boxes were drawn and we compared results to a ground truth and identified false positives.

The network was able to detect of the total images of pedestrians but it also returned false positives. The task was then to check if filtering out some or all of these errors in the legacy AI’s output could be achieved with the help of a single linear functional separating these errors from correct detections.

To construct this functional we had to define a suitable linear space and the correspondig representations of the legacy AI’s state in this space. In this case, the AI’s state was represented by the feature vectors taken from the second to last fully connected layer in our VGG-11 network. These extracted feature vectors had real-valued components suggesting that the data could be embedded in . For the purposes of constructing these single-funcional AI correctors we generated feature vectors () from the positive images in the AI’s own training set. These feature vectors have been combined into the set . In addition to the set we generated the set comprised of the feature vectors corresponding to the identified false positives.

Elements of the sets and have been centered and projected onto the first principal components of . This produced pre-processed sets and . We then generated varying-size subsets of and constructed Fisher linear discriminants with the weight and threshold for , . The threshold was chosen so that .

Once the functional have been constructed and the threshold determined, the corresponding linear model was placed at the end of our detection pipeline. For any proposal given a positive score by the legacy AI network we extracted the feature vector and then run it through the linear model (after subtraction of the mean of and projection onto the first principal components of the set ). Any detection that gave a non-negative score was consequently removed by turning its detection score negative.

Figure 10 shows the typical performance of the legacy AI system with the corrector.

Figure 10: Performance of one-neuron corrector built using Fisher linear discriminant. Left panel: the number of false positives removed as a function of the number of false positives the model was built on. Stars indicate the number of false positives used for building the model. Squares correspond to the number of false positives removed by the model. Right panel: the number of true positives removed as a function of the number of false positives removed. The actual measurements are shown as squares. Solid line is the least-square fit of a quadratic.

Single false positive were removed without any detriment on the true positive rates. Increasing the number of false positives that the single neuron model is to filter resulted in gradual deterioration of the true positive rate.

In addition to linear model based on the Fisher discriminant, we have aslo used soft-margin Support Vector Machines as a benchmark (Fig. 11). The true positives were taken from the CNN training data set.

Figure 11: Performance of one-neuron corrector constructed using the linear Support Vector Machine model. Left panel: the number of false positives removed as a function of the number of false positives the model was trained on. Stars indicate the number of false positives used for training the model. Circles correspond to the number of false positives removed by the model Right panel: the number of true positives removed as a function of the number of false positives removed. The actual measurements are presented as circles. Solid line is the least-square fit of a quadratic.

The SVM model successfully removes false positives without affecting the true positives rate. Its performance deteriorates at a much slower rate than for the Fisher model. This, however, is balanced by the fact that the Fisher cap model removed significantly more false positives than it was trained on.

From the results shown in Figs. 10, 11 it is evident that simple single-neuron correctors are capable of correcting mistakes of legacy AI systems. Filtering large numbers of errors comes at some cost to the detection of true positive, as expected. As we increase the number of false positives that we train our linear model on, the number of removed true positives starts to rise too. Our tests illustrate, albeit empirically, that the separation theorems are indeed viable in applications. It is also evident that in this particular test generalization capacity of Fisher discriminant was significantly higher than that of the Support Vector Machines. As can be seen from Fig. 10, Fig. 11, left panels, Fisher discriminants are more efficient in ’cutting off’ larger number of mistakes than they have been trained on. In the operational domain near the origin in Fig. 10, they do so without doing much damage to the legacy AI system. The Support Vector Machines performed better for larger values of false positives. This is not surprising as their training is iterative and accounts for much more information about the data than simply the sample covariance and mean.

3.5.2 Knowledge transfer between AI systems

In TyuGor2017knowlege we made a step forward from one-neuron error correction and showed that the theoretical framework of Stochastic Separation Theorems can be employed to produce simple and computationally efficient algorithms for automated knowledge spreading between AI systems. Spreading of knowledge was achieved via small add-on networks constructed from one-neuron correctors and supplementing decision-making circuits.

To illustrate the concept we considered two AI systems, a teacher and a student. These two systems have been developed for the same task: to detect pedestrians in live video streams. The teacher AI was modeled by an adapted SqueezeNet iandola2016squeezenet with about K trainable parameters. The network was trained on a ‘teacher’ dataset comprised of K non-pedestrian (negatives), and K pedestrian (positives) images. Pedestrian images have then been subjected to standard augmentation accounting for various geometric and colour perturbations. The student AI was modelled by a linear classifier with Histograms of Oriented Gradients features Dalal:2005 and trainable parameters. The values of these parameters were the result of the student AI training on a “student” dataset, a sub-sample of the “teacher” dataset comprising of K positives (K after augmentation) and K negatives, respectively. The choice of student and teacher AI systems enabled us to emulate interaction between low-power edge-based AIs and their more powerful counterparts that could be deployed on a higher-spec embedded system or, possibly, on a server, or in a computational cloud.

The approach was tested on several video sequences, including the NOTTINGHAM video (Nottingham, ) we used before. For each frame, the student AI system decided whether it contains an image of pedestrian. These proposals were then evaluated by the teacher AI. If a pedestrian shape was confirmed by the teacher AI in the reported proposal then no actions were taken. Otherwise an error (false positive) was reported, and the corresponding proposal along with its feature vector were retained for the purposes of uptraining and testing the student AI. The set of errors was then used to produce new training and testing sets for the student AI. This followed by construction of small networks correcting errors of the student AI (see TyuGor2017knowlege for further details). It is worthwhile to note that positives from the video were not included in the training set used to produce these networks. Performance of the student AI with and without correcting networks on the NOTTINGHAM video is illustrated with Fig. 12. As we can see from the figure, new knowledge (corrections of erroneous detects) have been successfully spread to the student AI. Moreover, the uptrained student system showed some degree of generalization of new skills which can be explained by correlations between feature vectors corresponding to false positives.

Figure 12: Performance of the student AI on the training set (left panel) and on the testing set (right panel). Red circles correspond to student AI without correcting networks. Blue squares capture performance of the student AI with a correcting network comprised of nodes, and green triangles correspond to the student AI supplemented with a -node correcting network.

4 Encoding and rapid learning of memories by single neurons

4.1 The problems with codifying memories

Some of the memory functions are performed by stratified brain structures, e.g., the hippocampus. The CA1 region of the hippocampus is constituted by a monolayer of morphologically similar pyramidal cells oriented with their main axis in parallel (Fig. 13a). One of the major excitatory inputs to these neurons comes from the CA3 region through Schaffer collaterals Amaral1989 ; Ishizuka1990 ; Wittner2007 , which can be considered as a hub routing information among various brain structures. Each CA3 pyramidal neuron sends an axon that bifurcates and leaves multiple collaterals in the CA1 with dominant parallel orientation (Fig. 13b). This structural organisation allows multiple parallel axons conveying multidimensional ‘spatial’ information from one area (CA3) simultaneously leave synaptic contacts on multiple neurons in another area (CA1). Thus, we have simultaneous convergence and divergence of the information content (Fig. 13b, right).

Figure 13: Organisation of encoding memories by single neurons in laminar structures: a) Laminar organization of the CA3 and CA1 areas in the hippocampus facilitates multiple parallel synaptic contacts between neurons in these areas by means of Schaffer collaterals; b) Axons from CA3 pyramidal neurons bifurcate and pass through the CA1 area in parallel (left panel) giving rise to the convergence-divergence of the information content (right panel). Multiple CA1 neurons receive multiple synaptic contacts from CA3 neurons; c) Schematic representation of three memory encoding schemes: (1) Selectivity. A neuron (shown in green) receives inputs from multiple presynaptic cells that code different information items. It detects (responds to) only one stimulus (purple trace), whereas rejecting the others; (2) Clustering. Similar to 1, but now a neuron (shown in blue) detects a group of stimuli (purple and blue traces) and ignores the others; (3) Acquiring memories. A neuron (shown in red) learns dynamically a new memory item (blue trace) by associating it with a know one (purple trace).

Experimental findings suggest that there exist mechanisms for efficient increase of the selectivity of individual synaptic contacts and, in turn, codifying of memories:

  • The multiple CA1 pyramidal cells distributed in the rostro-caudal direction are activated near-synchronously by assemblies of simultaneously firing CA3 pyramidal cells Ishizuka1990 ; Li1994 ; Benito2014 .

  • Thus, an ensemble of single neurons in the CA1 can receive simultaneously the same synaptic input (Fig. 13b, left).

  • These neurons have different topology and functional connectivity Finnerty1983 , therefore, their response to the same input can be different.

  • Experimental in-vivo results show that long term potentiation can significantly increase the spike transfer rate in the CA3-CA1 pathway Fernandez2012 .

We now can formulate the fundamental questions about the information encoding and formation of memories by single neurons and their ensembles in laminated brain structures TyukinBrain2017 :

  1. Selectivity: Detection of one stimulus from a set (Fig. 13c.1). Pick an arbitrary stimulus from a reasonably large set such that a single neuron from a neuronal ensemble detects this stimulus, i.e. generates a response. What is the probability that this neuron is stimulus-specific, i.e., it rejects all the other stimuli from the set?

  2. Clustering: Detection of a group of stimuli from a set (Fig. 13c.2). Within a set of stimuli we select a smaller subset, i.e., a group of stimuli. What is the probability that a neuron detecting all stimuli from this subset stays silent for all remaining stimuli in the set? Solution of this problem is expected to depend on the similarity of the stimuli inside the group and their dissimilarity from the remainder.

  3. Acquiring memories: Learning new stimulus by associating it with one already known (Fig. 13c.3). Let us consider two different stimuli and such that for they do not overlap in time and a neuron detects , but not . In the next interval , the stimuli start to overlap in time (i.e., they stimulate the neuron together). For the neuron receives only stimulus . What is the probability that for some the neuron detects ?

It may seem that for large sets of stimuli the probabilities in the stated questions should be small. Nevertheless, it is not the case and we aim to show that they can be close to one even for exponentially large (with respect to dimension) sets of different stimuli.

Of course, evaluation of these probabilities depends on the ‘Outer World Models’ and, therefore, cannot be exact. The standard machine learning assumptions Vapnik (2000); cucker2002mathematical are never correct but can bring some clarity before more general and realistic models are employed. Following this standard, we assume that the stimuli are generated in accordance with some distribution. Then, all stimuli that a neuron may receive are i.i.d. samples from this distribution. Once a sample is generated, a stimuli sub-sample is indepently selected for testing purposes. Relaxing these assumptions is possible (see theorems above). Nevertheless, they remain an important and useful first step of the analysis.

The outer world model should be supplemented by the model of a neuron and its adaptation. In this section we show how the formulated three fundamental questions can be addressed within a simple classical modeling framework, where a neuron is represented by a perceptron equipped with a Hebbian-type of learning. The efficiency of the mixture of learnable single neurons increases with dimension of data as it is expected in the light of stochastic separation theorems.

4.2 The model: signals, neurons and learning

We follow conventional and rather general functional representation of signalling in the neuronal pathways. We assume that upon receiving an input, a neuron can either generate a response or remain silent. Forms of the neuronal responses as well as the definitions of synaptic inputs vary from one model to another. Therefore, here we adopt a rather general functional approach. Under a stimulus we understand a number of excitations simultaneously (or within a short time window) arriving to a neuron through several axons (Fig. 13b) and thus transmitting some ‘spatially coded’ information items Benito2016 . If the neuron responds to a stimulus (e.g., generates output spikes or increases its firing rate), we then say that the neuron detects the informational content of the given stimulus.

Consider a neuronal aggregate consisting of independent neurons. Each neuron receives a sequence of external stimuli through synaptic contacts. In what follows we will refer to as the ‘neuronal dimension’. Assume that there are different stimuli ( can be large but finite), the th individual stimulus is modeled by a function

(19)

where is the -dimensional vector of the stimulus content, its coordinates codify the information transmitted by individual axons, are the time moments of the th stimulus presentation, the function defines the time window of the stimulus context. For simplicity, we use a rectangular window:

(20)

where is the window width.

We assume that the frames of the presentation of the same stimulus do not overlap: for all , .

The overall input is the sum of the individual stimuli (19):

(21)

High dimension of the outer world is the crucial assumption: we analyse the high-dimensional brain in a high-dimensional world. Formally, this means that the distribution of the vectors of the information content of stimuli is essentially high-dimensional. Various types of this assumption were discussed in Sec. 3: uniform distributions in -dimensional balls or cubes, Gaussian distributions, log-concave distributions, etc.

We use the simple classical threshold model of neurons. Each neuron has synaptic inputs. The -dimensional vector of synaptic weights is . The membrane potential (the summarised synaptic input) of the neuron is the inner product . The neuron generates a response , where is the firing threshold and is the transfer function; for and for

. In computational experiments we used the ‘rump’ activation function

. Given a choice of the activation function , the neuron is completely determined by the pair .

This threshold model disregards the specific dynamics of the neuronal response but even this simple model is already sufficient to clarify the fundamental problems of encoding memories TyukinBrain2017 .

Synaptic plasticity is necessary for appearance of encoded memories. Dynamics of synaptic weights depends on the stimulus , the summarised synaptic input , and the neuronal response . According to a modified classical Oja rule Oja1982 , it is constructed as a pursuit of the vector by the synaptic weights vector :

(22)

with and nonzero initial conditions

The difference from the classical Oja model is in the factor : the rate of plasticity is proportional to the neuronal response. It nullifies for zero response: the pursuit of the stimuli content by the synaptic weights is controlled by the neuronal response.

Solutions of the proposed model (22) remain bounded in forward time. Indeed, after multiplication of (22) by we obtain

(23)

where , , and by construction. Therefore, if then and does not grow.

4.3 Selectivity 1: detection of one stimulus from a set

All the stochastic separation theorems are, in their essence, theorems about selective detection of one stimulus from a large set (with rejection of all other stimuli from this set). For example, Proposition 1 gives the following selectivity theorem (cf. TyukinBrain2017 ).

Theorem 7.

Let -dimensional vectors of the stimuli content belong to a ball in with center (mean stimulus) and radius . Let , , and . Assume that a new stimulus content is sampled randomly from the equidistribution in this ball. Then with probability the neuron with

(24)

detects selectively the stimulus with content and rejects all the stimuli with content ().

Expression for in (24) connects the neural threshold with the threshold in Fisher’s discriminant (1).

Robustness of separation is an important question: how wide is the interval of possible values of the synaptic weights for separation of the stimuli? The following extension of Theorem 7 gives the answer to this question.

Theorem 8.

Let the conditions of Theorem 7 hold, , and . Then with probability every neuron with

(25)

detects selectively the stimulus with content and rejects all the stimuli with content ().

To prove this theorem, it is sufficient to add a spheric layer of thickness to each ball of excluded volume (the coloured ball in Fig. 7) and repeat the proof of Proposition 1.

The parameter can be called a relative robustness. It is the ratio of the possible allowed deviation to the radius of the the ball that includes the stimuli content vectors. According to Theorem 8, the allowed interval of is . The upper bound of the number of stimuli decreases if grows, for given probability bound of wrong detection , Fisher’s threshold , and dimension of input signals .

Other estimates of the level of robustness for equidistribution of all stimuli content vectors in a ball were presented in TyukinBrain2017 .

The neuronal threshold in (24), (25) is a sum of two terms. The first term, is just a right hand side of the definition of Fisher’s discriminant (1). The plasticity dynamics (22) produces normalised vectors of synaptic weights if the neuron is not silent at average (i.e. the average value of the coefficient in (23) does not vanish for large time). For normalised vectors the first term in expression for coincides with Fisher’s discriminant threshold .

The second term, appeared because the stimuli content vectors are not assumed to be centralised. Of course, for batch algorithms of machine learning, the input vectors are known in advance and the centralisation is a standard routine operation. This approach is also convenient for theoretical analysis. The situation is different for on-line learning: the stimuli are not known in advance and, moreover, there may be a ‘concept drift’ and the sequence of stimuli may be very far from a stationary random process. Precise centralisation with constant center is impossible in such cases, but various moving average centralisation algorithms are applicable. In these methods, depends on time and is evaluated as average in a time frame, average with exponential kernel or with linear combination of exponential kernels with different exponents, etc. Such methods are very popular for analysis of non-stationary time series, from financial analysis and econometrics to biometrics (see, for example, Brock1992 ).

Recently, it was demonstrated in vivo that the ‘threshold adaptation’ with various moving averages is an important part of adaptation of neurons. Slow voltage fluctuations do not contribute to spiking because they are filtered by threshold adaptation and the relevant time-dependent variable is not the membrane potential, but rather its distance to a dynamic threshold, which is called the ‘effective signal’ Fontaine2014 . The spike threshold depends on the history of the membrane potential on many temporal scales and the threshold adaptation allows neurons to compute specifically and robustly and better detect well-correlated inputs Huang2016 .

These ideas lead to the cascade of adaptation models. Instead of one model (22) we have several pursuit models. For example, for two time scales (adaptation of average and adaptation of neurons) we can write two equations:

(26)

where the rate of adaptation , is the vector of ‘effective stimulus’, is the ‘effective signal’ Fontaine2014 , and is the neuronal response.

There are empirical evidences that the rate of average threshold adaptation depends on the neuronal response of a group of neurons and tends to zero if they remain silent Huang2016 . For example, we can assume , where are neuronal responses of the neurons from the group.

Moreover, there may exist multiscale adaptation processes with different ‘averages’ : the long term average, the average deviation from the long term average during a shorter time, etc. Slower processes with smaller can correspond to larger groups of neurons. Construction of such a cascade of equations is obvious. The last equation in this cascade is adaptation of an individual neuron, as is written in (26).

4.4 Selectivity 2 (clustering): Detection of a group of stimuli from a set

For many models of the stimuli distribution, the possibility of selective detection of stimuli from a cluster with rejection all other stimuli is determined by the separability of this cluster from the average stimuli content . Consider a set of ‘relevant’ stimuli content vectors that should be separated from all ‘irrelevant’ stimuli. Assume that the stimuli content vectors are uniformly distributed in the -dimensional ball with centre and radius and there exists such a normalised weight vector and that

(27)

for all .

The cluster is contained in a half of the ball of radius