During the last two decades, the curse of dimensionality in data analysis was complemented by the blessing of dimensionality: if a dataset is essentially high-dimensional then, surprisingly, some problems get easier and can be solved by simple and robust old methods. The curse and the blessing of dimensionality are closely related, like two sides of the same coin. The research landscape of these phenomena is gradually becoming more complex and rich. New theoretical achievements and applications provide a new context for old results. The single-cell revolution in neuroscience, phenomena of grandmother cells and sparse coding discovered in the human brain meet the new mathematical ‘blessing of dimensionality’ ideas. In this mini-review, we aim to provide a short guide to new results on the blessing of dimensionality and to highlight the path from the curse of dimensionality to the blessing of dimensionality. The selection of material and angle of view is based on our own experience. We are not trying to cover everything in the subject of review, but rather fill in the gaps in existing tutorials and surveys.
R. Bellman Bellman (1957) in the preface to his book, discussed the computational difficulties of multidimensional optimization and summarized them under the heading “curse of dimensionality”. He proposed to re-examine the situation, not as a mathematician, but as a “practical man” Bellman (1954), and concluded that the price of excessive dimensionality “arises from a demand for too much information”. Dynamic programming was considered a method of dimensionality reduction in the optimization of a multi-stage decision process. Bellman returned to the problem of dimensionality reduction many times in different contexts Bellman and Kalaba (1961). Now, dimensionality reduction is an essential element of the engineering (the “practical man”) approach to mathematical modeling Gorban et al. (2006)
. Many model reduction methods were developed and successfully implemented in applications, from various versions of principal component analysis to approximation by manifolds, graphs, and complexesJolliffe (1993); Gorban et al. (2008); Gorban and Zinovyev (2010)
, and low-rank tensor network decompositionsCichocki et al. (2016, 2017).
Various reasons and forms of the curse of dimensionality were classified and studied, from the obvious combinatorial explosion (for example, forbinary Boolean attributes, to check all the combinations of values we have to analyze cases) to more sophisticated distance concentration: in a high-dimensional space, the distances between randomly selected points tend to concentrate near their mean value, and the neighbor-based methods of data analysis become useless in their standard forms Beyer et al. (1999); Pestov (2013). Many “good” polynomial time algorithms become useless in high dimensions.
Surprisingly, however, and despite the expected challenges and difficulties, common-sense heuristics based on the simple and the most straightforward methods “can yield results which are almost surely optimal” for high-dimensional problemsKainen (1997). Following this observation, the term “blessing of dimensionality” was introduced Kainen (1997); Brown et al. (1997). It was clearly articulated as a basis of future data mining in the Donoho “Millenium manifesto” Donoho (2000)
. After that, the effects of the blessing of dimensionality were discovered in many applications, for example in face recognitionChen et al. (2013), in analysis and separation of mixed data that lie on a union of multiple subspaces from their corrupted observations Liu et al. (2016)
, in multidimensional cluster analysisMurtagh (2009), in learning large Gaussian mixtures Anderson et al. (2014), in correction of errors of multidimensonal machine learning systems Gorban et al. (2016), in evaluation of statistical parameters Li et al. (2018)
, and in the development of generalized principal component analysis that provides low-rank estimates of the natural parameters by projecting the saturated model parametersLandgraf and Lee (2019).
or in recovering a vector of signals from corrupted measurementsCandes et al. (2005), and even in such specific problems as analysis and classification of EEG patterns for attention deficit hyperactivity disorder diagnosis Pereda et al. (2018).
There exist exponentially large sets of pairwise almost orthogonal vectors (‘quasiorthogonal’ bases, Kainen and Kůrková (1993)) in Euclidean space. It was noticed in the analysis of several
-dimensional random vectors drawn from the standard Gaussian distribution with zero mean and identity covariance matrix, that all the rays from the origin to the data points have approximately equal length, are nearly orthogonal and the distances between data points are all abouttimes larger Hall et al. (2005). This observation holds even for exponentially large samples (of the size for some , which depends on the degree of the approximate orthogonality) Gorban et al. (2016). Projection of a finite data set on random bases can reduce dimension with preservation of the ratios of distances (the Johnson–Lindenstrauss lemma Dasgupta and Gupta (2003)).
Such an intensive flux of works ensures us that we should not fear or avoid large dimensionality. We just have to use it properly. Each application requires a specific balance between the extraction of important low-dimensional structures (‘reduction’) and the use of the remarkable properties of high-dimensional geometry that underlie statistical physics and other fundamental results Gorban and Tyukin (2018); Vershynin (2018).
Both the curse and the blessing of dimensionality are the consequences of the measure concentration phenomena Giannopoulos and Milman (2000); Ledoux (2005); Gorban and Tyukin (2018); Vershynin (2018). These phenomena were discovered in the development of the statistical backgrounds of thermodynamics. Maxwell, Boltzmann, Gibbs, and Einstein found that for many particles the distribution functions have surprising properties. For example, the Gibbs theorem of ensemble equivalence Gibbs (1960) states that a physically natural microcanonical ensemble (with fixed energy) is statistically equivalent (provides the same averages of physical quantities in the thermodynamic limit) to a maximum entropy canonical ensemble (the Boltzmann distribution). Simple geometric examples of similar equivalence gives the ‘thin shell’ concentration for balls: the volume of a high-dimensional ball is concentrated near its surface. Moreover, a high-dimensional sphere is concentrated near any equator (waist concentration; the general theory of such phenomena was elaborated by M. Gromov Gromov (2003)). P. Lévy Lévy (1951) analysed these effects and proved the first general concentration theorem. Modern measure concentration theory is a mature mathematical discipline with many deep results, comprehensive reviews Giannopoulos and Milman (2000), books Ledoux (2005); Dubhashi and Panconesi (2009), advanced textbooks Vershynin (2018), and even elementary geometric introductions Ball (1997). Nevertheless, surprising counterintuitive results continue to appear and push new achievements in machine learning, Artificial Intelligence (AI), and neuroscience.
This mini-review focuses on several novel results: stochastic separation theorems and evaluation of goodness of clustering in high dimensions, and on their applications to corrections of AI errors. Several possible applications to the dynamics of selective memory in the real brain and ‘simplicity revolution in neuroscience’ are also briefly discussed.
2 Stochastic Separation Theorems
2.1 Blessing of Dimensionality Surprises and Correction of AI Mistakes
D. Donoho and J. Tanner Donoho and Tanner (2009) formulated several ‘blessing of dimensionality’ surprises. In most cases, they considered
points sampled independently from a standard normal distribution in dimension. Intuitively, we expect that some of the points will lie on the boundary of the convex hull of these points, and the others will be inside the interior of the hull. However, for large and
, this expectation is wrong. This is the main surprise. With a high probabilityall random points are vertices of their convex hull. It is sufficient that for some and that depend on only Gorban et al. (2018, 2019). Moreover, with a high probability, each segment connecting a pair of vertices is also an edge of the convex hull, and any simplex with vertices from the sample is a -dimensional face of the convex hull for some range of values of
. For uniform distributions in a ball, similar results were proved earlier by I. Bárány and Z. FürediBárány and Füredi (1988). According to these results, each point of a random sample can be separated from all other points by a linear functional, even if the set is exponentially large.
Such a separability is important for the solution of a technological problem of fast, robust and non-damaging correction of AI mistakes Gorban and Tyukin (2018); Gorban et al. (2018, 2019). AI systems make mistakes and will make mistakes in the future. If a mistake is detected, then it should be corrected. The complete re-training of the system requires too much resource and is rarely applicable to the correction of a single mistake. We proposed to use additional simple machine learning systems, correctors, for separation of the situations with higher risk of mistake from the situations with normal functioning Gorban et al. (2016); Gorban and Tyukin (2017) (Figure 1). The decision rules should be changed for situations with higher risk. Inputs for correctors are: the inputs of the original AI systems, the outputs of this system and (some) internal signals of this system Gorban et al. (2018, 2019). The construction of correctors for AI systems is crucial in the development of the future AI ecosystems.
Correctors should Gorban and Tyukin (2018):
not damage the existing skills of the AI system;
allow fast non-iterative learning;
correct new mistakes without destroying the previous fixes.
Of course, if an AI system made too many mistakes then their correctors could conflict. In such a case, re-training is needed with the inclusion of new samples.
2.2 Fisher Separablity
Linear separation of data points from datasets Bárány and Füredi (1988); Donoho and Tanner (2009) is a good candidate for the development of AI correctors. Nevertheless, from the ‘practical man’ point of view, one particular case, Fisher’s discriminant Fisher (1936), is much more preferable to the general case because it allows one-shot and explicit creation of the separating functional.
Consider a finite data set
without any hypothesis about the probability distribution. Letbe the standard inner product in . Let us define Fisher’s separability following Gorban et al. (2018). A point is Fisher-separable from a finite set with a threshold () if
This definition coincides with the textbook definition of Fisher’s discriminant if the data set
is whitened, which means that the mean point is in the origin and the sample covariance matrix is the identity matrix. Whitening is often a simple by-product of principal component analysis (PCA) because, on the basis of principal components, whitening is just the normalization of coordinates to unit variance. Again, following the ‘practical’ approach, we stress that the precise PCA and whitening are not necessary but rather a priori bounded condition number is needed: the ratio of the maximal and the minimal eigenvalues of the empirical covariance matrix after whitening should not exceed a given number, independently of the dimension.
A finite set is called Fisher-separable, if each point is Fisher-separable from the rest of the set (Definition 3, Gorban et al. (2018)).
A finite set is called Fisher-separable with threshold if inequality (1) holds for all such that . The set is called Fisher-separable if there exists some () such that is Fisher-separable with threshold .
2.3 Stochastic Separation for Distributions with Bounded Support
Let us analyse the separability of a random point from a finite set in the -dimensional unit ball . Consider the distributions that can deviate from the equidistribution, and these deviations can grow with dimension but not faster than the geometric progression with the common ratio , and, hence, the maximal density satisfies:
where constant does not depend on .
For such a distribution in the unit ball, the probability to find a random point in the excluded volume (Figure 2) tends to 0 as a geometric progression with the common ratio when .
(Theorem 1, Gorban et al. (2018)) Let , , and . Assume that a probability distribution in the unit ball has a density with maximal value , which satisfies inequality (3). Then the probability that a random point from this distribution is Fisher-separable from is , where the probability of inseparability
Let us evaluate the probability that a random set is Fisher-separable. Assume that each point of is randomly selected from a distribution that satisfies (3). These distributions could be different for different .
(Theorem 2, Gorban et al. (2018)) Assume that a probability distribution in the unit ball has a density with maximal value , which satisfies inequality (3). Let and . Then the probability that is Fisher-separable is , where the probability of inseparability
The difference in conditions from Theorem 2.3 is that here and in Theorem 2.3 . Again, can grow exponentially with the dimension as the geometric progression with the common factor , while faster than the geometric progression with the common factor .
For illustration, if is an i.i.d. sample from the uniform distribution in the 100-dimensional ball and , then with probablity this set is Fisher-separable Gorban and Tyukin (2017).
V. Kůrková Kůrková (2019) emphasized that many attractive measure concentration results are formulated for i.i.d. samples from very simple distributions (Gaussian, uniform, etc.), whereas the reality of big data is very different: the data are not i.i.d. samples from simple distributions. The machine learning theory based on the i.i.d. assumption should be revised, indeed Gorban et al. (2019). In the theorems above two main restrictions were employed: the probability of a set occupying relatively small volume could not be large (3), and the support of the distribution is bounded. The requirement of identical distribution of different points is not needed. The independence of the data points can be relaxed Gorban et al. (2018). The boundedness of the support of distribution can be transformed to the ‘not-too-heavy-tail’ condition. The condition ‘sets of relatively small volume should not have large probability’ remains in most generalisations. It can be considered as ‘smeared absolute continuity’ because absolute continuity means that the sets of zero volume have zero probability. Theorems 2.3 and 2.3 have numerous generalisations Gorban et al. (2018, 2019); Grechuk (2019); Kůrková and Sanguineti (2019). Let us briefly list some of them:
Log-concave distributions (a distribution with density is log-concave if the set is convex and is a convex function on ). In this case, the possibility of an exponential (non-Gaussian) tail brings a surprise: the upper size bound of the random set , sufficient for Fisher-separability in high dimensions with high probability, grows with dimension as , i.e. slower than exponential (Theorem 5, Gorban et al. (2018)).
Strongly log-concave distributions. A log concave distribution is strongly log-concave if there exists a constant such that
In this case, we return to the exponential estimation of the maximal allowed size of (Corollary 4, Gorban et al. (2018)). The comparison theorems Gorban et al. (2018) allow us to combine different distributions, for example the distribution from Theorem 2.3 in a ball with the log-concave or strongly log-concave tail outside the ball.
The kernel versions of the stochastic separation theorem were found, proved and applied to some real-life problems Tyukin et al. (2019).
There are also various estimations beyond the standard i.i.d. hypothesis Gorban et al. (2018) but the general theory is yet to be developed.
2.5 Some Applications
The correction methods were tested on various AI applications for videostream processing: detection of faces for security applications and detection of pedestrians Gorban et al. (2018); Meshkinfamfard et al. (2018); Gorban et al. (2019), translation of Sign Language into text for communication between deaf-mute people Tyukin et al. (2019), knowledge transfer between AI systems Tyukin et al. (2018), medical image analysis, scanning and classifying archaeological artifacts Allison et al. (2018), etc., and even to some industrial systems with relatively high level of errors Tyukin et al. (2019).
Application of the corrector technology to image processing was patented together with industrial partners Romanenko et al. (2019). A typical test of correctors’ performance is described below. For more detail of this test, we refer to Gorban et al. (2019)
. A convolutional neural network (CNN) was trained to detect pedestrians in images. A set of 114,000 positive pedestrian and 375,000 negative non-pedestrian RGB images, re-sized to, were collected and used as a training set. The testing set comprised of 10,000 positives and 10,000 negatives. The training and testing sets did not intersect.
We investigated in the computational experiments if it is possible to take one of cutting edge CNNs and train a one-neuron corrector to eliminate all the false positives produced. We also look at what effect, this corrector had on true positive numbers.
For each positive and false positive we extracted the second to last fully connected layer from CNN. These extracted feature vectors have dimension 4096. We applied PCA to reduce the dimension and analyzed how the effectiveness of the correctors depends on the number of principal components retained. This number varied in our experiments from 50 to 2000. The 25 false positives, taken from the testing set, were chosen at random to model single mistakes of the legacy classifier. Several such samples were chosen. For data projected on more than the first 87 principal components one neuron with weights selected by the Fisher linear discriminant formula corrected 25 errors without doing any damage to classification capabilities (original skills) of the legacy AI system on the training set. For 50 or less principal components this separation is not perfect.
Single false positives were corrected successfully without any increase of the true positive rates. We removed more than 10 false positives at no cost to true positive detections in the street video data (Nottingham) by the use of a single linear function. Further increasing the number of corrected false positives demonstrated that a single-neuron corrector could result in gradual deterioration of the true positive rates.
3 Clustering in High Dimensions
Producing a special corrector for every single mistake seems to be a non-optimal approach, despite some successes. In practice, happily, often one corrector improves performance and prevents the system from some new mistakes because they are correlated. Moreover, mistakes can be grouped in clusters and we can create correctors for the clusters of situations rather than for single mistakes. Here we meet another measure concentration ‘blessing’. In high dimensions, clusters are good (well-separated) even in the situations when one can expect their strong intersection. For example, consider two clusters and the distance-based clustering. Let and be the mean squared Euclidean distances between the centroids of the clusters and their data points, and be the distance between two centroids. The standard criteria of clusters’ quality Xu and Wunsch (2008) compare with and assume that for ‘good’ clusters . Assume the opposite, and evaluate the volume of the intersection of two balls with radii , . The intersection of the spheres (boundaries of the balls) is a -dimensional sphere with the centre (Figure 3). Assume , which means that is situated between the centers of the balls (otherwise, the biggest ball includes more than a half of the volume of the smallest one). The intersection of clusters belongs to a ball of radius :
and the fractions of the volume of the two initial balls in the intersection is less then . These fractions evaluate the probability to confuse points between the clusters (for uniform distributions, for the Gaussian distributions the estimates are similar). We can measure the goodness of high-dimensional clusters by
Note that exponentially tends to zero with increase. Small means ‘good’ clustering.
If then the probability to find a data point in the intersection of the balls (the ‘area of confusion’ between clusters) is negligible for uniform distributions in balls, isotropic Gaussian distributions and always when small volume implies small probability. Therefore, the clustering of mistakes for correction of high-dimensional machine learning systems gives good results even if clusters are not very good in the standard measures, and correction of clustered mistakes requires much fewer correctors for the same or even better accuracy Tyukin et al. (2019).
We implemented the correctors with separation of clustered false-positive mistakes from the set of true positive and tested them on the classical face detection task Tyukin et al. (2019). The legacy object detector was an OpenCV implementation of the Haar face detector. It has been applied to video footage capturing traffic and pedestrians on the streets of Montreal. The powerful MTCNN face detector was used to generate ground truth data. The total number of true positives was 21896, and the total number of false positives was 9372. The training set contained randomly chosen 50% of positives and false positives. PCA was used for dimensionality reduction with 200 principal components retained. Single-cluster corrector allows one to filter 90% of all errors at the cost of missing 5% percent of true positives. In dimension 200, a cluster of errors is sufficiently well-separated from the true positives. A significant classification performance gain was observed with more clusters, up to 100.
Further increase of dimension (the number of principal components retained) can even damage the performance because the number of features does not coincide with the dimension of the dataset, and the whitening with retained minor component can lead to ill-posed problems and loss of stability. For more detail, we refer to Tyukin et al. (2019).
4 What Does ‘High Dimensionality’ Mean?
The dimensionality of data should not be naively confused with the number of features. Let us have objects with features. The usual data matrix in statistics is a 2D array with rows and columns. The rows give values of features for an individual sample, and the columns give values of a feature for different objects. In classical statistics, we assume that and even study asymptotic estimates for and fixed. But, the modern ‘post-classical’ world is different Donoho (2000): the situation with (and even ) is not anomalous anymore. Moreover, it can be considered in some sense as the generic case: we can measure a very large number of attributes for a relatively small number of individual cases.
In such a situation the default preprocessing method could be recommended Moczko et al. (2016): transform the data matrix with into the square matrix of inner products (or correlation coefficients) between the individual data vectors. After that, apply PCA and all the standard machinery of machine learning. New data will be presented by projections on the old samples. (Detailed description of this preprocessing and the following steps is presented in Moczko et al. (2016) with an applied case study for and .) Such a preprocessing reduces the apparent dimension of the data_space to .
PCA gives us a tool for estimating the linear dimension of the dataset. Dimensionality reduction is achieved by using only the first few principal components. Several heuristics are used for evaluation of how many principal components should be retained:
The classical Kaiser rule recommends to retain the principal components corresponding to the eigenvalues of the correlation matrix (or where is a selected threshold; often is selected). This is, perhaps, the most popular choice.
Control of the fraction of variance unexplained. This approach is also popular, but it can retain too many minor components that can be considered ‘noise’.
Conditional number control Gorban et al. (2018) recommends to retain the principal components corresponding to , where is the maximal eigenvalue of the correlation matrix and is the upper border of the conditional number (the recommended values are Dormann et al. (2013)). This recommendation is very useful because it provides direct control of multicollinearity.
After dimensionality reduction, we can perform whitening of data and apply the stochastic separation theorems. This requires a hypothesis about the distribution of data: sets of a relatively small volume should not have a high probability, and there should be no ‘heavy tails’. Unfortunately, this assumption is not always true in the practice of big data analysis. (We are grateful to G. Hinton and V. Kůrková for this comment.)
The separability properties can be affected by various violations of i.i.d. structure of data, inhomogeneity of data, small clusters and fine-grained lumping, and other peculiarities Albergante et al. (2019). Therefore, the notion of dimension should be revisited. We proposed to use the Fisher separability of data to estimate the dimension Gorban et al. (2018). For regular probability distributions, this estimate will give a standard geometric dimension, whereas, for complex (and often more realistic) cases, it will provide a more useful dimension characteristic. This approach was tested Albergante et al. (2019) for many bioinformatic datasets.
For analysis of Fisher’s separability and related estimation of dimensionality for general distribution and empirical datasets, an auxiliary random variable is used Gorban et al. (2018); Albergante et al. (2019). This is the probability that a randomly chosen point is not Fisher-separable with threshold from a given data point by the discriminant (1):
where is the probability measure for .
If is selected at random (not compulsory with the same distribution as ) then is a random variable. For a finite dataset the probability that the data point is not Fisher-separable with threshold from can be evaluated by the sum of for :
Comparison of the empirical distribution of to the distribution evaluated for the high-dimensional sphere can be used as information about the ‘effective’ dimension of data. The probability is the same for all and exponentially decreases for large . We assume that is sampled randomly from for the rotationally invariant distribution on the unit sphere . For large the asymptotic formula holds Gorban et al. (2018); Albergante et al. (2019):
Here means that when (the functions here are strictly positive). It was noticed that the asymptotically equivalent formula with the denominator performs better in small dimensions Albergante et al. (2019).
The introduced measure of dimension performs competitively with other state-of-the-art measures for simple i.i.d. data situated on manifolds Gorban et al. (2018); Albergante et al. (2019). It was shown to perform better in the case of noisy samples and allows estimation of the intrinsic dimension in situations where the intrinsic manifold, regular distribution and i.i.d. assumptions are not valid Albergante et al. (2019).
After this revision of the definition of data dimension, we can answer the question from the title of this section: What does ‘high dimensionality’ mean? The answer is given by the stochastic separation estimates for the uniform distribution in the unit sphere . Let . We use notation for the volume (surface) of . The points of , which are not Fisher-separable from with a given threshold , form a spherical cap with the base radius (Figure 4). The area of this cap is estimated from above by the lateral surface of the cone with the same base, which is tangent to the sphere at the base points (see Figure 4). Therefore, the probability that a point selected randomly from the rotationally invariant distribution on is not Fisher-separable from is estimated from above as
The surface area of is
where is Euler’s gamma-function.
Rewrite the estimate (8) as
Recall that is a monotonically increasing logarithmically convex function for Artin (2015). Therefore, for
Take into account that (because ). After elementary transforms it gives us
Finally, we got an elementary estimate for from above
Compared to (7), this estimate from above is asymptotically exact.
Estimate from above a probability of a separability violations using (11) and an elementary rule: for any family of events ,
then all sample points with a probability greater than are Fischer-separable from a given point with a threshold . Similarly, if
then with probability greater than each sample point is Fisher-separable from the rest of the sample with a threshold .
Estimates (13) and (14) provide sufficient conditions for separability. The Table 1 illustrates these estimates (the upper borders of in these estimates are presented in the table with three significant figures). For an illustration of the separability properties, we estimated from above the sample size for which the Fisher-separability is guaranteed with a probability 0.99 and a threshold value (Table 1). These sample sizes grow fast with dimension. From the Fisher-separability point of view, dimensions 30 or 50 are already large. The effects of high-dimensional stochastic separability emerge with increasing dimensionality much earlier than, for example, the appearance of exponentially large quasi-orthogonal bases Gorban et al. (2016).
|n = 10||n = 20||n = 30||n = 40||n = 50||n = 60||n = 70||n = 80|
5 Discussion: The Heresy of Unheard-of Simplicity and Single Cell Revolution in Neuroscience
V. Kreinovich Kreinovich (2019) summarised the impression from the effective AI correctors based on Fisher’s discriminant in high dimensions as “The heresy of unheard-of simplicity” using quotation of the famous Pasternak poetry. Such a simplicity appears also in brain functioning. Despite our expectation that complex intellectual phenomena is a result of a perfectly orchestrated collaboration between many different cells, there is a phenomenon of sparse coding, concept cells, or so-called ‘grandmother cells’ which selectively react to the specific concepts like a grandmother or a well-known actress (‘Jennifer Aniston cells’) Quian Quiroga et al. (2005). These experimental results continue the single neuron revolution in sensory psychology Barlow (1972).
The idea of grandmother or concept cells was proposed in the late 1960s. In 1972, Barlow published a manifest about the single neuron revolution in sensory psychology Barlow (1972). He suggested: “our perceptions are caused by the activity of a rather small number of neurons selected from a very large population of predominantly silent cells.” Barlow presented many experimental evidences of single-cell perception. In all these examples, neurons reacted selectively to the key patterns (called ‘trigger features’). This reaction was invariant to various changes in conditions.
The modern point of view on the single-cell revolution was briefly summarised recently by R. Quian Quiroga Quian Quiroga (2019). He mentioned that the ‘grandmother cells’ were invented by Lettvin “to ridicule the idea that single neurons can encode specific concepts”. Later discoveries changed the situation and added more meaning and detail to these ideas. The idea of concept cells was evolved during decades. According to Quian Quiroga, these cells are not involved in identifying a particular stimulus or concept. They are rather involved in creating and retrieving associations and can be seen as the “building blocks of episodic memory”. Many recent discoveries used data received from intracranial electrodes implanted in the medial temporal lobe (MTL; the hippocampus and surrounding cortex) for patients medications. The activity of dozens of neurons can be recorded while patients perform different tasks. Neurons with high selectivity and invariance were found. In particular, one neuron fired to the presentation of seven different pictures of Jennifer Aniston and her spoken and written name, but not to 80 pictures of other persons. Emergence of associations between images was also discovered.
Some important memory functions are performed by stratified brain structures, such as the hippocampus. The CA1 region of the hippocampus includes a monolayer of morphologically similar pyramidal cells oriented parallel to the main axis (Figure 5). In humans, CA1 region of the hippocampus contains of pyramidal neurons. Excitatory inputs to these neurons come from the CA3 regions (ipsi- and contra-lateral). Each CA3 pyramidal neuron sends an axon that bifurcates and leaves multiple collaterals in the CA1 (Figure 5b). This structural organization allows transmitting multidimensional information from the CA3 region to neurons in the CA1 region. Thus, we have simultaneous convergence and divergence of the information content (Figure 5b, right). A single pyramidal cell can receive around 30,000 excitatory and 1700 inhibitory inputs (data for rats Megías et al. (2001)). Moreover, these numbers of synaptic contacts of cells vary greatly between neurons Druckmann et al. (2014)
. There are nonuniform and clustered connectivity patterns. Such a variability is considered as a part of the mechanism enhancing neuronal feature selectivityDruckmann et al. (2014). However, anatomical connectivity is not automatically transferred into functional connectivity and a realistic model should decrease significantly (by several orders of magnitude) the number of functional connections (see, for example, Brivanlou et al. (2004)). Nevertheless, even several dozens of effective functional connections are sufficient for the application of stochastic separation theorems (see Table 1).
For sufficiently high-dimensional sets of input signals a simple enough functional neuronal model with Hebbian learning (the generalized Oja rule Tyukin et al. (2019); Gorban et al. (2019)) is capable of explaining the following phenomena:
the extreme selectivity of single neurons to the information content of high-dimensional data (Figure 5(c1)),
simultaneous separation of several uncorrelated informational items from a large set of stimuli (Figure 5(c2)),
dynamic learning of new items by associating them with already known ones (Figure 5(c3)).
These results constitute a basis for the organization of complex memories in ensembles of single neurons.
Re-training large ensembles of neurons is extremely time and resources consuming both in the brain and in machine learning. It is, in fact, impossible to realize such a re-training in many real-life situations and applications. “The existence of high discriminative units and a hierarchical organization for error correction are fundamental for effective information encoding, processing and execution, also relevant for fast learning and to optimize memory capacity” Varona (2019).
The multidimensional brain is the most puzzling example of the ‘heresy of unheard-of simplicity’, but the same phenomenon has been observed in social sciences and in many other disciplines Kreinovich (2019).
There is a fundamental difference and complementarity between analysis of essentially high-dimensional datasets, where simple linear methods are applicable, and reducible datasets for which non-linear methods are needed, both for reduction and analysis Gorban and Tyukin (2018). This alternative in neuroscience was described as high-dimensional ‘brainland’ versus low-dimensional ‘flatland’ Barrio (2019). The specific multidimensional effects of the ‘blessing of dimensionality’ can be considered as the deepest reason for the discovery of small groups of neurons that control important physiological phenomena. On the other hand, even low dimensional data live often in a higher-dimensional space and the dynamics of low-dimensional models should be naturally embedded into the high-dimensional ‘brainland’. Thus, a “crucial problem nowadays is the ‘game’ of moving from ‘brainland’ to ‘flatland’ and backward” Barrio (2019).
C. van Leeuwen formulated a radically opposite point of view van Leeuwen (2019): neither high-dimensional linear models nor low-dimensional non-linear models have serious relations to the brain.
The devil is in the detail. First of all, the preprocessing is always needed to extract the relevant features. The linear method of choice is PCA. Various versions of non-linear PCA can be also useful Gorban et al. (2008). After that, nobody has a guarantee that the dataset is either essentially high-dimensional or reducible. It can be a mixture of both alternatives, therefore both extraction of reducible lower-dimensional subset for nonlinear analysis and linear analysis of the high dimensional residuals could be needed together.
Conceptualization, ANG, VAM and IYT; Methodology, ANG, VAM and IYT; Writing–Original Draft Preparation, ANG; Writing–Editing, ANG, VAM and IYT; Visualization, ANG, VAM and IYT.
The work was supported by the Ministry of Science and Higher Education of the Russian Federation (Project No. 14.Y26.31.0022). Work of ANG and IYT was also supported by Innovate UK (Knowledge Transfer Partnership grants KTP009890; KTP010522) and University of Leicester. VAM acknowledges support from the Spanish Ministry of Economy, Industry, and Competitiveness (grant FIS2017-82900-P).
The authors declare no conflict of interest. References
- Bellman (1957) Bellman, R. Dynamic Programming; Princeton University Press: Princeton, NJ, USA, 1957.
- Bellman (1954) Bellman, R. The theory of dynamic programming. Bull. Amer. Math. Soc. 1954, 60, 503–515.
- Bellman and Kalaba (1961) Bellman, R.; Kalaba, R. Reduction of dimensionality, dynamic programming, and control processes. J. Basic Eng. 1961, 83, 82–84, https://doi.org/10.1115/1.3658896.
- Gorban et al. (2006) Gorban, A.N.; Kazantzis, N.; Kevrekidis, I.G.; Öttinger, H.C.; Theodoropoulos, C. (Eds.) Model Reduction and Coarse–Graining Approaches for Multiscale Phenomena; Springer: Berlin/Heidelberg, Germany, 2006.
- Jolliffe (1993) Jolliffe, I. Principal Component Analysis; Springer: Berlin/Heidelberg, Germany, 1993.
- Gorban et al. (2008) Gorban, A.N.; Kégl, B.; Wunsch, D.; Zinovyev, A. (Eds.) Principal Manifolds for Data Visualisation and Dimension Reduction; Springer: Berlin/Heidelberg, Germany, 2008; https://doi.org/10.1007/978-3-540-73750-6.
- Gorban and Zinovyev (2010) Gorban, A.N.; Zinovyev, A. Principal manifolds and graphs in practice: from molecular biology to dynamical systems. Int. J. Neural Syst. 2010, 20, 219–232, https://doi.org/10.1142/S0129065710002383.
- Cichocki et al. (2016) Cichocki, A.; Lee, N.; Oseledets, I.; Phan, A.H.; Zhao, Q.; Mandic, D.P. Tensor networks for dimensionality reduction and large-scale optimization: Part 1 low-rank tensor decompositions. Found. Trends Mach. Learn. 2016, 9, 249–429, https://doi.org/10.1561/2200000059.
- Cichocki et al. (2017) Cichocki, A.; .; Phan, A.H.; Zhao, Q.; Lee, N.; Oseledets, I.; Sugiyama, M.; Mandic, D.P. Tensor networks for dimensionality reduction and large-scale optimization: Part 2 applications and future perspectives. Found. Trends Mach. Learn. 2017, 9, 431–673, https://doi.org/10.1561/2200000067.
- Beyer et al. (1999) Beyer, K.; Goldstein, J.; Ramakrishnan, R.; Shaft, U. When is “nearest neighbor” meaningful? In Proceedings of the 7th International Conference on Database Theory (ICDT), Jerusalem, Israel, 10–12 January 1999; pp. 217–235, https://doi.org/10.1007/3-540-49257-7_15.
- Pestov (2013) Pestov, V. Is the k-NN classifier in high dimensions affected by the curse of dimensionality? Comput. Math. Appl. 2013, 65, 1427–1437, https://doi.org/10.1016/j.camwa.2012.09.011.
- Kainen (1997) Kainen, P.C. Utilizing geometric anomalies of high dimension: when complexity makes computation easier. In Computer-Intensive Methods in Control and Signal Processing: The Curse of Dimensionality; Warwick, K., Kárný, M., Eds.; Springer: New York, NY, USA, 1997; pp. 283–294, https://doi.org/10.1007/978-1-4612-1996-5_18.
- Brown et al. (1997) Brown, B.M.; Hall, P.; Young, G.A. On the effect of inliers on the spatial median. J. Multivar. Anal. 1997, 63, 88–104, https://doi.org/10.1006/jmva.1997.1691.
- Donoho (2000) Donoho, D.L. High-Dimensional Data Analysis: The Curses and Blessings of Dimensionality; Invited lecture at Mathematical Challenges of the 21st Century, AMS National Meeting, Los Angeles, CA, USA, August 6-12, 2000; CiteSeerX 10.1.1.329.3392.
- Chen et al. (2013) Chen, D.; Cao, X.; Wen, F.; Sun, J. Blessing of dimensionality: High-dimensional feature and its efficient compression for face verification. https://doi.org/10.1109/CVPR.2013.389.
- Liu et al. (2016) Liu, G.; Liu, Q.; Li, P. Blessing of dimensionality: Recovering mixture data via dictionary pursuit. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 47–60, https://doi.org/10.1109/TPAMI.2016.2539946.
- Murtagh (2009) Murtagh, F. The remarkable simplicity of very high dimensional data: application of model-based clustering. J. Classif. 2009, 26, 249–277, https://doi.org/10.1007/s00357-009-9037-9.
- Anderson et al. (2014) Anderson, J.; Belkin, M.; Goyal, N.; Rademacher, L.; Voss, J. The More, the Merrier: the Blessing of Dimensionality for Learning Large Gaussian Mixtures. In Proceedings of The 27th Conference on Learning Theory, Barcelona, Spain, 13–15 June 2014; Balcan, M.F.; Feldman, V.; Szepesvári, C., Eds.; PMLR: Barcelona, Spain, 2014; Volume 35, pp. 1135–1164.
- Gorban et al. (2016) Gorban, A.N.; Tyukin, I.Y.; Romanenko, I. The blessing of dimensionality: Separation theorems in the thermodynamic limit. IFAC-PapersOnLine 2016, 49, 64–69, https://doi.org/10.1016/j.ifacol.2016.10.755.
- Li et al. (2018) Li, Q.; Cheng, G.; Fan, J.; Wang, Y. Embracing the blessing of dimensionality in factor models. J. Am. Stat. Assoc. 2018, 113, 380–389, https://doi.org/10.1080/01621459.2016.1256815.
- Landgraf and Lee (2019) Landgraf, A.J.; Lee, Y. Generalized principal component analysis: Projection of saturated model parameters. Technometrics 2019, https://doi.org/10.1080/00401706.2019.1668854.
- Donoho (2006) Donoho, D.L. Compressed sensing. IEEE Trans. Inf. Theory 2006, 52, 1289–1306, https://doi.org/10.1109/TIT.2006.871582.
Donoho and Tanner (2009)
Donoho, D.; Tanner, J.
Observed universality of phase transitions in high-dimensional geometry, with implications for modern data analysis and signal processing.Phil. Trans. R. Soc. A 2009, 367, 4273–4293, https://doi.org/10.1098/rsta.2009.0152.
Candes et al. (2005)
Candes, E.; Rudelson, M.; Tao, T.; Vershynin, R.
Error correction via linear programming.In Proceedings of the 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS’05), Pittsburgh, PA, USA, 23–25 October 2005; pp. 668–681, https://doi.org/10.1109/SFCS.2005.5464411.
- Pereda et al. (2018) Pereda, E.; García-Torres, M.; Melián-Batista, B.; nas, S.M.; Méndez, L.; González, J.J. The blessing of Dimensionality: Feature Selection outperforms functional connectivity-based feature transformation to classify ADHD subjects from EEG patterns of phase synchronisation. PLoS ONE 2018, 13, e0201660, https://doi.org/10.1371/journal.pone.0201660.
- Kainen and Kůrková (1993) Kainen, P.; Kůrková, V. Quasiorthogonal dimension of Euclidian spaces. Appl. Math. Lett. 1993, 6, 7–10, https://doi.org/10.1016/0893-9659(93)90023-G.
- Hall et al. (2005) Hall, P.; Marron, J.; Neeman, A. Geometric representation of high dimension, low sample size data. J. Royal Stat. Soc. B 2005, 67, 427–444, https://doi.org/10.1111/j.1467-9868.2005.00510.x.
- Gorban et al. (2016) Gorban, A.N.; Tyukin, I.; Prokhorov, D.; Sofeikov, K. Approximation with random bases: Pro et contra. Inf. Sci. 2016, 364–365, 129–145, https://doi.org/10.1016/j.ins.2015.09.021.
- Dasgupta and Gupta (2003) Dasgupta, S.; Gupta, A. An elementary proof of a theorem of Johnson and Lindenstrauss. Random Sruct. Algor. 2003, 22, 60–65, https://doi.org/10.1002/rsa.10073.
- Gorban and Tyukin (2018) Gorban, A.N.; Tyukin, I.Y. Blessing of dimensionality: mathematical foundations of the statistical physics of data. Phil. Trans. R. Soc. A 2018, 376, 20170237, https://doi.org/10.1098/rsta.2017.0237.
High-Dimensional Probability: An Introduction with Applications in Data Science; Cambridge Series in Statistical and Probabilistic Mathematics; Cambridge University Press: Cambridge, UK, 2018.
- Giannopoulos and Milman (2000) Giannopoulos, A.A.; Milman, V.D. Concentration property on probability spaces. Adv. Math. 2000, 156, 77–106, https://doi.org/10.1006/aima.2000.1949.
- Ledoux (2005) Ledoux, M. The Concentration of Measure Phenomenon; Number 89 in Mathematical Surveys & Monographs; AMS: Providence, RI, USA, 2005.
- Gibbs (1960) Gibbs, J.W. Elementary Principles in Statistical Mechanics, Developed with Especial Reference to the Rational Foundation of Thermodynamics; Dover Publications: New York, NY, USA, 1960.
- Gromov (2003) Gromov, M. Isoperimetry of waists and concentration of maps. Geom. Funct. Anal. 2003, 13, 178–215, https://doi.org/10.1007/s00039-009-0703-1.
- Lévy (1951) Lévy, P. Problèmes Concrets D’analyse Fonctionnelle; Gauthier-Villars: Paris, France, 1951.
- Dubhashi and Panconesi (2009) Dubhashi, D.P.; Panconesi, A. Concentration of Measure for the Analysis of Randomized Algorithms; Cambridge University Press: Cambridge, UK, 2009.
- Ball (1997) Ball, K. An Elementary Introduction to Modern Convex Geometry. In Flavors of Geometry; Cambridge University Press: Cambridge, UK, 1997; Volume 31.
- Gorban et al. (2018) Gorban, A.N.; Golubkov, A.; Grechuk, B.; Mirkes, E.M.; Tyukin, I.Y. Correction of AI systems by linear discriminants: Probabilistic foundations. Inf. Sci. 2018, 466, 303–322, https://doi.org/10.1016/j.ins.2018.07.040.
- Gorban et al. (2019) Gorban, A.N.; Makarov, V.A.; Tyukin, I.Y. The unreasonable effectiveness of small neural ensembles in high-dimensional brain. Phys. Life Rev. 2019, 29, 55–88, https://doi.org/10.1016/j.plrev.2018.09.005.
- Bárány and Füredi (1988) Bárány, I.; Füredi, Z. On the shape of the convex hull of random points. Probab. Theory Relat. Fields 1988, 77, 231–240, https://doi.org/10.1007/BF00334039.
- Gorban and Tyukin (2017) Gorban, A.N.; Tyukin, I.Y. Stochastic separation theorems. Neural Netw. 2017, 94, 255–259, https://doi.org/10.1016/j.neunet.2017.07.014.
- Tyukin et al. (2019) Tyukin, I.Y.; Gorban, A.N.; McEwan, A.A.; Meshkinfamfard, S. Blessing of dimensionality at the edge. arXiv 2019, arXiv:1910.00445.
- Gorban et al. (2019) Gorban, A.N.; Burton, R.; Romanenko, I.; Tyukin, I.Y. One-trial correction of legacy AI systems and stochastic separation theorems. Inf. Sci. 2019, 484, 237–254, https://doi.org/10.1016/j.ins.2019.02.001.
- Fisher (1936) Fisher, R.A. The Use of Multiple Measurements in Taxonomic Problems. Ann. Eugenics 1936, 7, 179–188, https://doi.org/10.1111/j.1469-1809.1936.tb02137.x.
- Kůrková (2019) Kůrková, V. Some insights from high-dimensional spheres: Comment on “The unreasonable effectiveness of small neural ensembles in high-dimensional brain” by Alexander N. Gorban et al. Phys. Life Rev. 2019, 29, 98–100, https://doi.org/10.1016/j.plrev.2019.03.014.
- Gorban et al. (2019) Gorban, A.N.; Makarov, V.A.; Tyukin, I.Y. Symphony of high-dimensional brain. Reply to comments on “The unreasonable effectiveness of small neural ensembles in high-dimensional brain”. Phys. Life Rev. 2019, 29, 115–119, https://doi.org/10.1016/j.plrev.2019.06.003.
- Grechuk (2019) Grechuk, B. Practical stochastic separation theorems for product distributions. In Proceedings of the IEEE IJCNN 2019—International Joint Conference on Neural Networks, Budapest, Hungary, 14–19 July 2019; https://doi.org/10.1109/IJCNN.2019.8851817.
- Kůrková and Sanguineti (2019) Kůrková, V.; Sanguineti, M. Probabilistic Bounds for Binary Classification of Large Data Sets. In Proceedings of the International Neural Networks Society, Genova, Italy, 16–18 April 2019; Oneto, L., Navarin, N., Sperduti, A., Anguita, D., Eds.; Springer: Berlin/Heidelberg, Germany, 2019; Volume 1, pp. 309–319, https://doi.org/10.1007/978-3-030-16841-4_32.
- Tyukin et al. (2019) Tyukin, I.Y.; Gorban, A.N.; Grechuk, B. Kernel Stochastic Separation Theorems and Separability Characterizations of Kernel Classifiers. In Proceedings of the IEEE IJCNN 2019—International Joint Conference on Neural Networks, Budapest, Hungary, 14–19 July 2019; https://doi.org/10.1109/IJCNN.2019.8852278.
- Meshkinfamfard et al. (2018) Meshkinfamfard, S.; Gorban, A.N.; Tyukin, I.V. Tackling Rare False-Positives in Face Recognition: A Case Study. In Proceedings of the 2018 IEEE 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS); IEEE: Exeter, United Kingdom, 2018; pp. 1592–1598, https://doi.org/10.1109/HPCC/SmartCity/DSS.2018.00260.
- Tyukin et al. (2019) Tyukin, I.Y.; Gorban, A.N.; Green, S.; Prokhorov, D. Fast construction of correcting ensembles for legacy artificial intelligence systems: Algorithms and a case study. Inf. Sci. 2019, 485, 230–247, https://doi.org/10.1016/j.ins.2018.11.057.
- Tyukin et al. (2018) Tyukin, I.Y.; Gorban, A.N.; Sofeikov, K.; Romanenko, I. Knowledge transfer between artificial intelligence systems. Front. Neurorobot. 2018, 12, https://doi.org/10.3389/fnbot.2018.00049.
- Allison et al. (2018) Allison, P.M.; Sofeikov, K.; Levesley, J.; Gorban, A.N.; Tyukin, I.; Cooper, N.J. Exploring automated pottery identification [Arch-I-Scan]. Internet Archaeol. 2018, 50, https://doi.org/10.11141/ia.50.11.
- Romanenko et al. (2019) Romanenko, I.; Gorban, A.; Tyukin, I. Image Processing. US Patent 10,489,634 B2, Nov. 26, 2019. Available online: https://patents.google.com/patent/US10489634B2/en (accessed on 5 January 2020).
- Xu and Wunsch (2008) Xu, R.; Wunsch, D. Clustering; Wiley: Hoboken, NJ, USA, 2008.
- Moczko et al. (2016) Moczko, E.; Mirkes, E.M.; Cáceres, C.; Gorban, A.N.; Piletsky, S. Fluorescence-based assay as a new screening tool for toxic chemicals. Sci. Rep. 2016, 6, 33922, https://doi.org/10.1038/srep33922.
- Dormann et al. (2013) Dormann, C.F.; Elith, J.; Bacher, S.; Buchmann, C.; Carl, G.; Carré, G.; Marquéz, J.R.; Gruber, B.; Lafourcade, B.; Leitão, P.J.; et al. Collinearity: A review of methods to deal with it and a simulation study evaluating their performance. Ecography 2013, 36, 27–46, https://doi.org/10.1111/j.1600-0587.2012.07348.x.
- Albergante et al. (2019) Albergante, L.; Bac, J.; Zinovyev, A. Estimating the effective dimension of large biological datasets using Fisher separability analysis. In Proceedings of the IEEE IJCNN 2019—International Joint Conference on Neural Networks, Budapest, Hungary, 14–19 July 2019; https://doi.org/10.1109/IJCNN.2019.8852450.
- Artin (2015) Artin, E. The Gamma Function; Courier Dover Publications: Mineola, NY, USA, 2015.
- Kreinovich (2019) Kreinovich, V. The heresy of unheard-of simplicity: Comment on “The unreasonable effectiveness of small neural ensembles in high-dimensional brain” by A.N. Gorban, V.A. Makarov, and I.Y. Tyukin. Phys. Life Rev. 2019, 29, 93–95, https://doi.org/10.1016/j.plrev.2019.04.006.
- Quian Quiroga et al. (2005) Quian Quiroga, R.; Reddy, L.; Kreiman, G.; Koch, C.; Fried, I. Invariant visual representation by single neurons in the human brain. Nature 2005, 435, 1102–1107, https://doi.org/10.1038/nature03687.
- Barlow (1972) Barlow, H.B. Single units and sensation: a neuron doctrine for perceptual psychology? Perception 1972, 1, 371–394, https://doi.org/10.1068/p010371.
- Quian Quiroga (2019) Quian Quiroga, R. Akakhievitch revisited: Comment on “The unreasonable effectiveness of small neural ensembles in high-dimensional brain” by Alexander N. Gorban et al. Phys. Life Rev. 2019, 29, 111–114, https://doi.org/10.1016/j.plrev.2019.02.014.
Megías et al. (2001)
Megías, M.; Emri, Z.S.; Freund, T.F.; Gulyás, A.I.
Total number and distribution of inhibitory and excitatory synapses on hippocampal CA1 pyramidal cells.Neuroscience 2001, 102, 527–540, https://doi.org/10.1016/S0306-4522(00)00496-6.
- Druckmann et al. (2014) Druckmann, S.; Feng, L.; Lee, B.; Yook, C.; Zhao, T.; Magee, J.C.; Kim, J. Structured synaptic connectivity between hippocampal regions. Neuron 2014, 81, 629–640, https://doi.org/10.1016/j.neuron.2013.11.026.
- Brivanlou et al. (2004) Brivanlou, I.H.; Dantzker, J.L.; Stevens, C.F.; Callaway, E.M. Topographic specificity of functional connections from hippocampal CA3 to CA1. Proc. Natl. Acad. Sci. USA 2004, 101, 2560–2565, https://doi.org/10.1073/pnas.0308577100.
- Tyukin et al. (2019) Tyukin, I.; Gorban, A.N.; Calvo, C.; Makarova, J.; Makarov, V.A. High-dimensional brain: A tool for encoding and rapid learning of memories by single neurons. Bull. Math. Biol. 2019, 81, 4856–4888, https://doi.org/10.1007/s11538-018-0415-5.
- Varona (2019) Varona, P. High and low dimensionality in neuroscience and artificial intelligence: Comment on “The unreasonable effectiveness of small neural ensembles in high-dimensional brain” by A.N. Gorban et al. Phys. Life Rev. 2019, 29, 106–107, https://doi.org/10.1016/j.plrev.2019.02.008.
- Barrio (2019) Barrio, R. “Brainland” vs. “flatland”: how many dimensions do we need in brain dynamics? Comment on the paper “The unreasonable effectiveness of small neural ensembles in high-dimensional brain” by Alexander N. Gorban et al. Phys. Life Rev. 2019, 29, 108–110, https://doi.org/10.1016/j.plrev.2019.02.010.
- van Leeuwen (2019) van Leeuwen, C. The reasonable ineffectiveness of biological brains in applying the principles of high-dimensional cybernetics: Comment on “The unreasonable effectiveness of small neural ensembles in high-dimensional brain” by Alexander N. Gorban et al. Phys. Life Rev. 2019, 29, 104–105, https://doi.org/10.1016/j.plrev.2019.03.005.