1 Introduction: Five “Foundations”, from geometry to probability, quantum mechanics, statistical physics and machine learning
It’s not given us to foretell
How our words will echo through the ages,…F.I. Tyutchev, English Translation by F.Jude
The Sixth Hilbert Problem was inspired by the “investigations on the foundations of geometry” [1], i.e. by Hilbert’s work “The Foundations of Geometry” [2], which firmly implanted the axiomatic method not only in the field of geometry, but also in other branches of mathematics. The Sixth Problem proclaimed expansion of the axiomatic method beyond existent mathematical disciplines, into physics and further on.
The Sixth Problem sounds very unusual and not purely mathematical. This may be a reason why some great works which have been inspired by this problem have no reference to it. The most famous example is the von Neumann book [3] “Mathematical foundations of quantum mechanics”. John von Neumann was the assistant of Hilbert and they worked together on the mathematical foundation of quantum mechanics. This work was obviously in the framework of the Sixth Problem, but this framework was not mentioned in the book.
In 1933, Kolmogorov answered the Hilbert challenge of axiomatization of the theory of probability
[4]. He did not cite the sixth problem but explicitly referred to Hilbert’s “Foundations of Geometry” as the prototype for “the purely mathematical development” of the theory. But Hilbert in his 6th Problem asked for more, for “a rigorous and satisfactory development of the method of the mean values in mathematical physics”. He had in mind statistical physics and “in particular the kinetic theory of gases”. The 6th chapter of Kolmogorov’s book contains a survey of some results of the author and Khinchin about independence and the law of large numbers, and the Appendix includes a description of the 01 laws in probability. These are the first steps to a rigorous basis of “the method of mean values”. Ten years later, in 1943, Khinchin published a book “Mathematical foundations of statistical mechanics”
[5]. This has brought an answer to the Sixth Problem one step closer, but again without explicit reference to Hilbert’s talk. The analogy between the titles of von Neumann and Khinchin books is obvious.The main idea of statistical mechanics, in its essence, can be called the blessing of dimensionality: if a system can be presented as a union of many weakly interacting subsystems then, in the thermodynamic limit (when the number of such subsystems tends to infinity), the whole system can be described by relatively simple deterministic relations in the lowdimensional space of macroscopic variables. More means less – in very highdimensional spaces many differences between sets and functions become negligible (vanish) and the laws become simpler. This point of view on statistical mechanics was developed mainly by Gibbs (1902) (ensemble equivalence) [6] but Khinchin made the following remark about this work: “although the arguments are clear from the logical standpoint, they do not pretend to any analytical rigor”, exactly in the spirit of Hilbert’s request for “a rigorous and satisfactory development”. The devil is in the detail: how should we define the thermodynamic limit and in which sense the ensembles are equivalent? For some rigorously formulated conditions, the physical statements become exact theorems.
Khinchin considered two types of background theorems: ergodic theorems and limit theorems for highdimensional distributions. He claimed that the foundations of statistical mechanics should be a complete abstraction from the nature of the forces. Limit theorems utilize very general properties of distributions in high dimension, indeed, but the expectations that ergodicity is a typical and universal property of smooth highdimensional multiparticle Hamiltonian systems were not met [7]. To stress that the ergodicity problem is nontrivial, we have to refer to the Oxtoby–Ulam theorem about metric transitivity of a generic continuous transformation, which preserves volume [8]. (We see that typical properties of continuous transformations differ significantly from typical properties of smooth transformations).
Various programmes proposed for the mathematical foundation of statistical mechanics were discussed, for example, by Dobrushin [9] and Batterman [10]. Despite the impressive proof of ergodicity of some systems (hyperbolic flows or some billiard systems, for example), the Jaynes point of view [11] on the role of ergodicity in the foundations of statistical mechanics now became dominant; the Ergodic Hypothesis is neither necessary nor sufficient condition for the foundation of statistical mechanics (Dobrushin [9] attributed this opinion to Lebowitz, while Jaynes [11] referred to Gibbs [6], who, perhaps, “did not consider ergodicity as relevant to the foundation of the subject”).
Through the efforts of many mathematicians, the limit theorems from probability theory and results about ensemble equivalence from the foundation of statistical physics were developed far enough to become the general theory of measure concentration phenomena. Three works were especially important for our work
[12, 13, 14]. The book [15] gives an introduction into the mathematical theory of measure concentration. A simple geometric introduction into this phenomena was given by Ball [16].Perhaps, the simplest manifestation of measure concentration is the concentration of the volume of the highdimensional ball near the sphere. Let be a volume of the ndimensional ball of radius . It is useful to stress that the ‘ball’ here is not necessarily Euclidean and means the ball of any norm. Lévy [17]
recognised this phenomenon as a very important property of geometry of highdimensional spaces. He also proved that equidistributions in the balls are asymptotically equivalent in high dimensions to the Gaussian distributions with the same mean value of squared radius. Gibbs defacto used these properties for sublevel sets of energy to demonstrate equivalence of ensembles (microcanonical distribution on the surface of constant energy and canonical distribution in the phase space with the same mean energy).
Maxwell used the concentration of measure phenomenon in the following settings. Consider a rotationally symmetric probability distribution on the
dimensional unit sphere. Then its orthogonal projection on a line will be a Gaussian distribution with small variance
(for largewith high accuracy). This is exactly the Maxwellian distribution for one degree of freedom in a gas (and the distribution on the unit sphere is the microcanonical distribution of kinetic energy of gas, when the potential energy is negligibly small). Geometrically it means that if we look at the onedimensional projections of the unit sphere then the “observable diameter” will be small, of the order of
.Lévy noticed that instead of orthogonal projections on a straight line we can use any Lipschitz function (with ). Let points be distributed on a unit dimensional sphere with rotationally symmetric probability distribution. Then the values of
will be distributed ‘not more widely’ than a normal distribution around the mean value
; for allwhere is a constant, . Interestingly, if we use in this inequality the median value of ,
, instead of the mean, then the estimate of the constant
can be decreased: . From the statistical mechanics point of view, this Lévy Lemma describes the upper limit of fluctuations in gas for an arbitrary observable quantity . The only condition is the sufficient regularity of (Lipschitz property).Hilbert’s 6th Problem influenced this stream of research either directly (Kolmogorov and, perhaps, Khinchin among others) or indirectly, through the directly affected works. And it keeps to transcend this influence to other areas, including highdimensional data analysis, data mining, and machine learning.
On the turn of the millennium, Donoho gave a lecture about main problems of highdimensional data analysis [18] with the impressive subtitle: “The curses and blessings of dimensionality”. He used the term curse of dimensionality “to refer to the apparent intractability of systematically searching through a highdimensional space, the apparent intractability of accurately approximating a general highdimensional function, the apparent intractability of integrating a highdimensional function.” To describe the blessing of dimensionality he referred to the concentration of measure phenomenon, “which suggest that statements about very highdimensional settings may be made where moderate dimensions would be too complicated.” Anderson et al characterised some manifestations of this phenomenon as “The More, the Merrier” [19].
In 1997, Kainen described the phenomenon of blessing of dimensionality, illustrated them with a number of different examples in which high dimension actually facilitated computation, and suggested connections with geometric phenomena in highdimensional spaces [20].
The claim of Donoho’s talk was similar to Hilbert’s talk and he cited this talk explicitly. (“My personal research experiences, cited above, convince me of Hilbert’s position, as a long run proposition, operating on the scale of centuries rather than decades.”) The role of Hilbert’s 6th Problem in the analysis of the curse and blessing of dimensionality was not mentioned again.
The blessing of dimensionality and the curse of dimensionality are two sides of the same coin. For example, the typical property of a random finite set in a highdimensional space is: the squared distance of these points to a selected point are, with high probability close to the average (or median) squared distance. This property drastically simplifies the expected geometry of data (blessing) [21, 22] but, at the same time, makes the similarity search in high dimensions difficult and even useless (curse) [23].
Extension of the 6th Hilbert Problem to data mining and machine learning is a challenging task. There exist no unified general definition of machine learning. Most classical texts consider machine learning through formalisation and analysis of a set of standardised tasks [24, 25, 26]. Traditionally, these tasks are:

Classification – learning to predict a categorical attribute using values of given attributes on the basis of given examples (supervised learning);

Regression – learning to predict numerical attributes using values of given attributes on the basis of given examples (supervised learning);

Clustering – joining of similar objects in several clusters (unsupervised learning)
[27]; 
Probability distribution estimation.
For example, Cucker and Smale [24] considered the least square regression problem. This is the problem of the best approximation of an unknown function from a random sample of pairs . Selection of “the best” regression function means minimization of the mean square error deviation of the observed from the value . They use the concentration inequalities to evaluate the probability that the approximation has a given accuracy.
It is important to mention that the Cucker–Smale approach was inspired in particular by J. von Neumann: “We try to write in the spirit of H. Weyl and J. von Neumann’s contributions to the foundations of quantum mechanics” [24]. The J. von Neumann book [3] was a step in the realisation of Hilbert’s 6th problem programme, as we perfectly know. Therefore, the Cucker–Smale “Mathematical foundation of learning” is a grandchild of the 6th problem. This is the fourth “Foundation” (after Kolmogorov, von Neumann, and Khinchin). Indeed, it was an attempt to give a rigorous development of what they “have found to be the central ideas of learning theory”. This problem statement follows Hilbert’s request for “rigorous and satisfactory development of the method of mean values”, but this time the development was done for machine learning instead of mathematical physics.
Cucker and Smale followed Gauss and proved that the least squares solution enjoys remarkable statistical properties. i.e. it provides the minimum variance estimate [24]
. Nevertheless, nonquadratic functionals are employed for solution of many problems: to enhance robustness, to avoid oversensitivity to outliers, to find sparse regression with exclusion of nonnecessary input variables, etc.
[25, 26]. Even nonconvex quasinorms and their tropical approximations are used efficiently to provide sparse and robust learning results [32]. Vapnik [26] defined a formalised fragment of machine learning using minimisation of a risk functionalthat is the mathematical expectation of a general loss function.
M. Gromov [33] proposed a radically different concept of ergosystems which function by building their “internal structure” out of the “raw structures” in the incoming flows of signals. The essential mechanism of egrgosystem learning is goal free and independent of any reinforcement. In a broad sense, loosely speaking, in this concept “structure” = “interesting structure” and learning of structure is goalfree and should be considered as a structurally interesting process.
There are many other approaches and algorithms in machine learning, which use some specific ideas from statistical mechanics: annealing, spin glasses, etc. (see, for example, [34]) and randomization. It was demonstrated recently that the assignment of random parameters should be datadependent to provide the efficient and universal approximation property of the randomized learner model [35]. Various methods for evaluation of the output weights of the hidden nodes after random generation of new nodes were also tested [35]
. Swarm optimization methods for learning with random regeneration of the swarm (“virtual particles”) after several epochs of learning were developed in 1990
[36]. Sequential Monte Carlo methods for learning neural networks were elaborated and tested
[37]. A comprehensive overview of the classical algorithms and modern achievements in stochastic approaches to neural networks was performed by Scardapane and Wang [38].In our paper, we do not discuss these ideas, instead we focus on a deep and general similarity between highdimensional problems in learning and statistical physics. We summarise some phenomena of measure concentration which drastically affect machine learning problems in high dimension.
2 Waist concentration and random bases in machine learning
After classical works of Fisher [39] and Rosenblatt [40], linear classifiers have been considered as inception of Data Analytics and Machine Learning (see e.g. [26, 41, 42], and references therein). The mathematical machinery powering these developments is based on the concept of linear separability.
Definition 2.1.
Let and be subsets of . Recall that a linear functional on separates and if there exists a such that
A set is linearly separable if for each there exists a linear functional such that for all , .
If is a set of measurements or data samples that are labelled as “Class 1”, and is a set of data labelled as “Class 2” then a functional separating and is the corresponding linear classifier. The fundamental question, however, is whether such functionals exist for the given and , and if the answer is “Yes” then how to find them?
It is wellknown that if (i) and are disjoint, (ii) the cardinality, , of does not exceed , and (iii) elements of are in general position, then they are vertices of a simplex. Hence, in this setting, there always is a linear functional separating and .
Rosenblatt’s
[40] used a population of linear threshold elements with random synaptic weights (elements) as layer before an element, that is a linear threshold element which learns iteratively (authors of some papers and books called the elements “perceptrons” and lose the complex structure of perceptron with a layer of random elements). The randomly initiated elements of the first layer can undergo selection of the most relevant elements.According to Rosenblatt [40]
, any set of data vectors becomes linear separable after transformation by the layer of
elements, if the number of these randomly chosen elements is sufficiently large. Therefore the perceptron can solve any classification problem, where classes are defined by pointing out examples (ostensive definition). But this “sufficiently large” number of random elements depends on the problem and may be large, indeed. It can grow for a classification task proportionally to the number of the examples. The perceptron with sufficiently large number of elements can approximate binaryvalued functions on finite domains with arbitrary accuracy. Recently, the bounds on errors of these approximations are derived [43]. It is proven that unless the number of network units grows faster than any polynomial of the logarithm of the size of the domain, a good approximation cannot be achieved for almost any uniformly randomly chosen function. The results are obtained by application of concentration inequalities.The method of random projections became popular in machine learning after the JohnsonLindenstrauss Lemma [44], which states that relatively large sets of vectors in a highdimensional Euclidean space can be linearly mapped into a space of much lower dimension with approximate preservation of distances. This mapping can be constructed (with high probability) as a projection on random basis vectors with rescaling of the projection with a factor [45]. Repeating the projection times and selecting the best of them, one can achieve the appropriate accuracy of the distance preservation. The number of points can be exponentially large with ().
Two unit random vectors in high dimension are almost orthogonal with high probability. This is a simple manifestation of the socalled waist concentration [13]
. A highdimensional sphere is concentrated near its equator. This is obvious: just project a sphere onto a hyperplane and use the concentration argument for a ball on the hyperplane (with a simple trigonometric factor). This seems highly nontrivial, if we ask: near which equator? The answer is: near each equator. This answer is obvious because of rotational symmetry but it seems to be counterintuitive.
We call vectors , from Euclidean space orthogonal if (). Let and be i.i.d. random vectors distributed uniformly (rotationally invariant) on the unit sphere in Euclidean space . Then the distribution of their inner product satisfies the inequality (see, for example [16] or [46] and compare to Maxwellian and Lévy’s lemma):
Proposition 2.1.
Let be be i.i.d. random vectors distributed uniformly (rotationally invariant) on the unit sphere in Euclidean space . For
(1) 
all vectors are pairwise orthogonal with probability . [46]
There are two consequences of this statement: (i) in high dimension there exist exponentially many pairwise almost orthogonal vectors in , and (ii) random vectors are orthogonal with high probability even for exponentially large (1). Existence of exponentially large orthogonal systems in highdimensional spaces was discovered in 1993 by Kainen and Kůrková [47]. They introduced the notion of quasiorthogonal dimension, which was immediately utilised in the problem of random indexing of highdimensional data [21]. The fact that an exponentially large random set consists of pairwise orthogonal vectors with high probability was demonstrated in the work [46] and used for analysis of data approximation problem in random bases. We show that not only such orthogonal sets exist, but also that they are typical in some sense.
randomly generated vectors will be almost orthogonal to a given data vector (the angle between and will be close to with probability close to one). Therefore, the coefficients in the approximation of by a linear combination of could be arbitrarily large and the approximation problem will be illconditioned, with high probability. The following alternative is proven for approximation by random bases:

Approximation of a highdimensional data vector by linear combinations of randomly and independently chosen vectors requires (with high probability) generation of exponentially large “bases”, if we would like to use bounded coefficients in linear combinations.

If arbitrarily large coefficients are allowed, then the number of randomly generated elements that are sufficient for approximation is even less than dimension. We have to pay for such a reduction of the number of elements by illconditioning of the approximation problem.
We have to choose between a wellconditioned approximation problem in exponentially large random bases and an illconditional problem in relatively small (moderate) random bases. This dichotomy is fundamental, and it is a direct consequence of the waist concentration phenomenon. In what follows, we will formally present another concentration phenomenon, stochastic separation theorems [48, 49], and outline their immediate applications in AI and neuroscience.
3 Stochastic separation theorems and their applications in Artificial Intelligence systems
3.1 Stochastic separation theorems
Existence of a linear functional that separates two finite sets is no longer obvious when
. A possible way to answer both questions could be to cast the problem as a constrained optimization problem within the framework of e.g. support vector machines
[26]. The issue with this approach is that theoretical worstcase estimates of computational complexity for determining such functions are of the order (for quadratic loss functions); a posteriori analysis of experiments on practical use cases, however, suggest that the complexity could be much smaller and than and reduce to linear or even sublinear in [50].This apparent discrepancy between the worstcase estimates and aposteriori evaluation of computational complexities can be resolved if concentration effects are taken into account. If the dimension of the underlying topological vector space is large then random finite but exponentially large in samples are linearly separable, with high probability, for a range of practically relevant classes of distributions. Moreover, we show that the corresponding separating functionals can be derived using Fisher linear discriminants [39]. Computational complexity of the latter is linear in . It can be made sublinear too in if proper sampling is used to estimate corresponding covariance matrices. As we have shown in [49], the results hold for i.i.d. random points from equidistributions in a ball, a cube, and from distributions that are products of measures with bounded support. The conclusions are based on stochastic separation theorems for which the statements for relevant classes of distributions are provided below.
Theorem 3.1 (Equidistribution in [48, 49]).
Let be a set of i.i.d. random points from the equidustribution in the unit ball . Let , and . Then
(2) 
(3) 
(4) 
The proof of the theorem can be illustrated with Fig. 1. The probability that a single element, , belongs to the difference of two balls centred at is not smaller than . Consider the hyperplane
This hyperplane partitions the unit ball centred at into two disjoint subsets: the spherical cap (shown as grey shaded area in Fig. 1) and the rest of the ball. The element is in the shaded area and is on the line containing the vector . The volume of this spherical cap does not exceed the volume of the halfball of radius centred at (the ball is shown as a blue dashed circle in the figure). Recall that
(5) 
This assures that (2) holds. Applying the same argument to all elements of the set results in (3). Finally, to show that (4) holds, observe that the length of the segment on the tangent line to the sphere centred at is always smaller than . Hence the cosine of the angle between an element from and the vector is bounded from above by . The estimate now follows from (5).
According to Theorem 3.1, the probability that a single element from the sample is linearly separated from the set by the hyperplane is at least
This probability estimate depends on both and dimensionality . An interesting consequence of the theorem is that if one picks a probability value, say , then the maximal possible values of for which the set remains linearly separable with probability that is no less than grows at least exponentially with . In particular, the following holds
Corollary 3.1.
Let be a set of i.i.d. random points from the equidustribution in the unit ball . Let , and . If
(6) 
then If
(7) 
then
In particular, if inequality (7) holds then the set is linearly separable with probability .
The linear separability property of finite but exponentially large samples of random i.i.d. elements is not restricted to equidistributions in . As has been noted in [22], it holds for equidistributions in ellipsoids as well as for the Gaussian distributions. Moreover, it can be generalized to product distributions in a unit cube. Consider, e.g. the case when coordinates of the vectors in the set
are independent random variables
, with expectations and variances . Let for all . The following analogue of Theorem 3.1 can now be stated.Theorem 3.2 (Product distribution in a cube [49]).
Let be i.i.d. random points from the product distribution in a unit cube. Let
and . Then
(8) 
(9) 
The proof is based on concentration inequalities in product spaces [14, 51]. Numerous generalisations of Theorems 3.1, 3.2 are possible for different classes of distributions, for example, for weakly dependent variables, etc.
Linear separability, as an inherent property of data sets in high dimension, is not necessarily confined to cases whereby a linear functional separates a single element of a set from the rest. Theorems 3.1, 3.2 be generalized to account for tuples, too. An example of such generalization is provided in the next theorem.
Theorem 3.3 (Separation of tuples [52]).
Let and be i.i.d. samples from the equidistribution in . Let be a subset of elements from such that
(10) 
Then
(11) 
where
subject to:
The separating linear functional is again the inner product, and the separating hyperplane can be taken in the form [52]:
(12) 
and is the maximizer of the nonlinear program in the righthand side of (11), and . To see this, observe that , , for all , with probability . With this probability the following estimate holds:
Hence
If the elements of are uncorrelated, i.e. the values of are small, then the distance from the spherical cap induced by linear functional (12) to the center of the ball decreases as . This means that the lowerbound probability estimate in (11) is expected to decrease too. On the other hand, if the elements of are all positively correlated, i.e. , then one can derive a lowerbound probability estimate which does not depend on .
Peculiar properties of data in high dimension, expressed in terms of linear separability, have several consequences and applications in the realm of Artificial Intelligence and Machine Learning of which the examples are provided in the next sections.
3.2 Correction of legacy AI systems
Legacy AI systems, i.e. AI systems that have been deployed and are currently in operation, are becoming more and more widespread. Wellknown commercial examples are provided by global multinationals, including Google, IMB, Amazon, Microsoft, and Apple. Numerous opensource legacy AIs have been created to date, together with dedicated software for their creation (e.g. Caffe
[53], MXNet [54], Deeplearning4j [55], and Tensorflow
[56] packages). These AI systems require significant computational and human resources to build. Regardless of resources spent, virtually any AI and/or machine learningbased systems are likely to make a mistake. Realtime correction of these mistakes by retraining is not always viable due to the resources involved. AI retraining is not necessarily desirable either, since AI’s performance after retraining may not always be guaranteed to exceed that of the old one. We can, therefore, formulate the technical requirements for the correction procedures. Corrector should: (i) be simple; (ii) not change the skills of the legacy system; (iii) allow fast noniterative learning; and (iv) allow correction of new mistakes without destroying of previous corrections.A possible remedy to this issue is the AI correction method [22] based on stochastic separation theorems. Suppose that at a time instance values of signals from inputs, outputs, and internal state of a legacy AI system could be combined together to form a single measurement object, . All entries in this object are numerical values, and each measurement corresponds to a relevant decision of the AI system at time . Over the course of the system’s existence a set of such measurements is collected. For each element in the set a label “correct” or “incorrect” is assigned, depending on external evaluation of the system’s performance. Elements corresponding to “incorrect” labels are then filtered out and dealt with separated by an additional subsystem, a corrector. A diagram illustrating the process is shown in Fig. 2.
In this diagram, the original legacy AI system (shown as Legacy AI System 1) is supplied with a corrector altering its responses. The combined new AI system can in turn be augmented by another corrector, leading to a cascade of AI correctors (see Fig. 2).
If distributions modelling elements of the set are e.g. an equidistribution in a ball or an ellipsoid, product of measures distribution, a Gaussian etc., then

These linear functions admit a closedform formulae (Fisher linear discriminant) and can be determined in a noniterative way.

Availability of explicit closedform formulae in the form of Fisher discriminant offers major computational benefits as it eliminates the need to employ iterative and more computationally expensive alternatives such as e.g. SVMs.

If a cascade of correctors is employed, performance of the corrected system drastically improves [22].
The results, perhaps, can be generalized to other classes of distributions that are regular enough to enjoy the stochastic separability property.
The corrector principle has been demonstrated in [22]
for a legacy AI system in the form of a convolutional neural network trained to detect pedestrians in images. AI errors were set to be false positives, and the corrector system had to remove labeled false positives by a single linear functional. Detailed description of the experiment is provided in
[22], and a performance snapshot is shown in Fig. 3.Dimensionality of the vectors was . As we can see from Fig. 3, single linear functionals are capable of removing several errors of a legacy AI without compromising the system’s performance. Note that AI errors, i.e. false positives, were chosen at random and have not been grouped or clustered to take advantage of positive correlation. (The definition of clusters could vary [27].) As the number of errors to be removed grows, performance starts to deteriorate. This is in agreement with our theoretical predictions (Theorem 3.3).
3.3 Knowledge transfer between AI systems
Legacy AI correctors can be generalized to a computational framework for automated AI knowledge transfer whereby labelling of the set is provided by an external AI system. AI knowledge transfer has been in the focus of growing attention during last decade [57]. Application of stochastic separation theorems to AI knowledge transfer was proposd in [52], and the corresponding functional diagram of this automated setup is shown in Fig. 4.
In this setup a student AI, denoted as , is monitored by a teacher AI, denoted as . Over a period of activity system generates a set of objects , . Exact composition of the set depends on a task at hand. If outputs differ to that of for the same input then an error is registered in the system. Objects associated with errors are combined into the set . The process gives rise to two disjoint sets:
Having created these two sets, knowledge transfer from to can now be organized in accordance with Algorithm 1. Note that data regularization and whitening are included in the preprocessing step of Algorithm 1.

Preprocessing

Centering. For the given set , determine the set average, , and generate sets

Regularization. Determine covariance matrices , of the sets and . Let ,
be their corresponding eigenvalues, and
be the eigenvectors of
. If some of , are zero or if the ratio is too large, project and onto appropriately chosen set of eigenvectors, :where is the matrix comprising of significant principal components of .

Whitening. For the centred and regularized dataset , derive its covariance matrix, , and generate whitened sets


Knowledge transfer

Clustering. Pick , , , and partition the set into clusters so that elements of these clusters are, on average, pairwise positively correlated. That is there are such that:

Construction of Auxiliary Knowledge Units. For each cluster , , construct separating linear functionals and thresholds :
where , are the averages of and , respectively. The separating hyperplane is .

Integration. Integrate Auxiliary Knowledge Units into decisionmaking pathways of . If, for an generated by an input to , any of then report accordingly (swap labels, report as an error etc.)

The algorithm can be used for AI correctors too. Similar to AI correction, AI knowledge transfer can be cascaded as well. Specific examples and illustrations of AI knowledge transfer based on stochastic separation theorems are discussed in [52].
3.4 Grandmother cells, memory, and highdimensional brain
Stochastic separation theorems are a generic phenomenon, and their applications are not limited to AI and machine learning systems. An interesting consequence of these theorems for neuroscience has been discovered and presented in [58]
. Recently, it has been shown that in humans new memories can be learnt very rapidly by supposedly individual neurons from a limited number of experiences
[59]. Moreover, neurons can exhibit remarkable selectivity to complex stimuli, the evidence that has led to debates around the existence of the socalled “grandmother” and “concept” cells [60, 61, 62], and their role as elements of a declarative memory. These findings suggest that not only the brain can learn rapidly but also it can respond selectively to “rare” individual stimuli. Moreover, experimental evidence indicates that such a cognitive functionality can be delivered by single neurons [59, 60, 61]. The fundamental questions, hence, are: How is this possible? and What could be the underlying functional mechanisms?It has been shown in [58] that stochastic separation theorems offer a simple answer to these fundamental questions. In particular, extreme neuronal selectivity and rapid learning can already be explained by these theorems. Modelwise, explanation of extreme selectivity is based on conventional and widely accepted phenomenological generic description of neural response to stimulation. Rapid acquisition of selective response to multiple stimuli by single neurons is ensured by classical Hebbian synaptic plasticity [63].
4 Conclusion
Twentythree Hilbert’s problems created important “focus points” for the concentration of efforts of mathematicians for a century. The Sixth Problem differs significantly from the other twentytwo problems. It is very far from being a purely mathematical problem. It seems to be impossible to imagine it’s “final solution”. The Sixth Problem is a “programmatic call” [64], and it works:

The modern theory of measure concentration phenomena has direct relations to the mathematical foundations of probability and statistical mechanics, uses results of Kolmogorov and Khinchin (among others), and definitely helps to create “a rigorous and satisfactory development of the method of the mean values…”.
The classical measure concentration theorems state that random points in a highlydimensional data distribution are concentrated in a thin layer near an average or median level set of a Lipschitz function. The stochastic separation theorems describe the fine structure of these thin layers: the random points are all linearly separable from the rest of the set even for exponentially large random sets. Of course, for all these concentration and separation theorems the probability distribution should be “genuinely” highdimensional. Equidistributions in balls or ellipsoids or the products of distributions with compact support and nonvanishing variance are the simple examples of such distributions. Various generalizations are possible.
For which dimensions does the blessing of dimensionality work? This is a crucial question. The naïve point of view that dimension of data is just a number of coordinates is wrong. This is the dimension of the dataspace, where data are originally situated. The notion of intrinsic dimension of data is needed [66, 67]. The situation when the number of data points is less (or even much less) than the dimension of the data space is not exotic. Moreover, Donoho [18] considered the property as a generic case in the “postclassical world” of data analysis. In such a situation we really explore data on a
dimensional plane and should modestly reduce our highdimensional claim. Projection of data on that plane can be performed by various methods. We can use as new coordinates projections of points on the known datapoints or Pearson’s correlation coefficients, when it is suitable, for example, when the datapoints are fragments of time series or large spectral images, etc. In these new coordinates the datatable becomes a square matrix and further dimensionality reduction could be performed using good old PCA (principal component analysis), or its nonlinear versions like principal manifolds
[28]or neural autoencoders
[68].A standard example can be found in [69]: the initial dataspace consisted of fluorescence diagrams and had dimension
. There were 62 datapoints, and a combination of correlation coordinates with PCA showed intrinsic dimension 4 or 5. For selection of relevant principal components the Kaiser rule, the broken stick models or other heuristical or statistical methods can be used
[70].Similar preprocessing ritual is helpful even in more “classical” cases when . The correlation (or projection) transformation is not essential here, but formation of relevant features with dimension reduction is important. If after model reduction and whitening (transformation of coordinates to get the unit covariance matrix, step i.c in Algorithm 1) the new dimension then for datapoints we can expect that the stochastic separation theorems work with probability . Thus separation of errors with Fisher’s linear discriminant is possible, and many other “blessing of dimensionality benefits” are achievable. Of course, some additional hypotheses about the distribution functions are needed for a rigorous proof, but there is practically no chance to check them a priori and the validation of the whole system a posteriori is necessary. In smaller dimensions (for example, less than 10), nonlinear data approximation methods can work well capturing the intrinsic complexity of data, like principal graphs do [29, 71].
We have an alternative: either essentially highdimensional data with thin shell concentrations, stochastic separation theorems, and efficient linear methods, or essentially lowdimensional data with efficient complex nonlinear methods. There is a problem of the ‘no man’s land’ inbetween. To explore this land, we can extract the most interesting lowdimensional structure and then consider the residual as an essentially highdimensional random set, which obeys stochastic separation theorems. We do not know now a theoretically justified efficient approach to this area, but here we should say following Hilbert: “Wir müssen wissen, wir werden wissen” (“We must know, we shall know”).
The authors declare that they have no competing interests.
Both authors made substantial contributions to conception, proof of the theorems, analysis of applications, drafting the article, revising it critically, and final approval of the version to be published.
This work was supported by Innovate UK grants KTP009890 and KTP010522. IT was supported by the Russian Ministry of Education and Science, projects 8.2080.2017/4.6 (assessment and computational support for knowledge transfer algorithms between AI systems) and 2.6553.2017/BCH Basic Part.
References
 [1] Hilbert D. 1902 Mathematical problems. Bull. Amer. Math. Soc. 8(10), 437–479. (doi:10.1090/S000299041902009233)
 [2] Hilbert D. 1902 The Foundations of Geometry. La Salle IL: Open court publishing Company. See http://www.gutenberg.org/ebooks/17384.
 [3] Von Neumann J. 1955 Mathematical Foundations of Quantum Mechanics. Princeton: Princeton University Press. (English translation from German Edition, Springer, Berlin, 1932.)
 [4] Kolmogorov AN. 1956 Foundations of the Theory of Probability. New York: Chelsea Publ. (English translation from German edition, Springer, Berlin, 1933.)
 [5] Khinchin AY. 1949 Mathematical Foundations of Statistical Mechanics. New York: Courier Corporation. (English translation from the Russian edition, Moscow – Leningrad, 1943.)
 [6] Gibbs GW. 1960 [1902] Elementary Principles in Statistical Mechanics, Developed With Especial Reference to the Rational Foundation of Thermodynamics. Dover Publications, New York.
 [7] Markus L, Meyer KR. 1974 Generic Hamiltonian Dynamical Systems are Neither Integrable Nor Ergodic. Memoirs of Amer. Math. Soc. 144. (http://dx.doi.org/10.1090/memo/0144)
 [8] Oxtoby JC, Ulam SM. 1941 Measurepreserving homeomorphisms and metrical transitivity. Ann. Math. 42. 874–920. (doi:10.2307/1968772)
 [9] Dobrushin RL. 1997 A mathematical approach to foundations of statistical mechanics. Atti dei Convegni Lincei – Accademia Nazionale dei Lincei 131, 227–244. See http://www.mat.univie.ac.at/esiprpr/esi179.pdf
 [10] Batterman RW. 1998 Why equilibrium statistical mechanics works: universality and the renormalization group. Philos. Sci. 65, 183–208. (doi:10.1086/392634)
 [11] Jaynes ET. 1967 Foundations of probability theory and statistical mechanics. In Delaware Seminar in the Foundations of Physics. Springer, Berlin – Heidelberg, 77–101. (doi:10.1007/9783642861024_6
 [12] Giannopoulos AA, Milman VD. 2000 Concentration property on probability spaces. Adv. Math. 156. 77–106. (doi:10.1006/aima.2000.1949)
 [13] Gromov M. 2003 Isoperimetry of waists and concentration of maps. Geom. Funct. Anal. 13, 178–215. (doi:10.1007/s0003900907031)
 [14] Talagrand M. 1995 Concentration of measure and isoperimetric inequalities in product spaces. Publications Mathematiques de l’IHES 81, 73–205. (doi:10.1007/BF02699376)
 [15] Ledoux M. 2001 The Concentration of Measure Phenomenon. (Mathematical Surveys & Monographs No. 89). Providence: AMS. (doi:10.1090/surv/089)
 [16] Ball K. 1997 An Elementary Introduction to Modern Convex Geometry. Flavors of Geometry, Vol. 31. Cambridge, UK: MSRI Publications. See http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.43.4601.
 [17] Lévy P. 1951 Problèmes concrets d’analyse fonctionnelle. Paris: GauthierVillars.
 [18] Donoho DL. 2000 Highdimensional data analysis: The curses and blessings of dimensionality. AMS Math Challenges Lecture, 1, 32 pp. See http://statweb.stanford.edu/donoho/Lectures/AMS2000/Curses.pdf.
 [19] Anderson J, Belkin M, Goyal N, Rademacher L, Voss J. 2014 The More, the Merrier: the Blessing of dimensionality for learning large Gaussian mixtures, Journal of Machine Learning Research: Workshop and Conference Proceedings 35, 1–30. See http://proceedings.mlr.press/v35/anderson14.pdf.
 [20] Kainen PC.1997 Utilizing geometric anomalies of high dimension: when complexity makes computation easier. In ComputerIntensive Methods in Control and Signal Processing: The Curse of Dimensionality. New York, Springer, 283–294. (10.1007/9781461219965_18)
 [21] HechtNielsen R. 1994 Context vectors: Generalpurpose approximate meaning representations selforganized from raw data. In Zurada J, Marks R, Robinson C, eds. Computational Intelligence: Imitating Life. New York: IEEE Press, 43–56.
 [22] Gorban AN, Romanenko I, Burton R, Tyukin I. 2016 Onetrial correction of legacy AI systems and stochastic separation theorems. See arXiv:1610.00494 [stat.ML]
 [23] Pestov V. 2013 Is the NN classifier in high dimensions affected by the curse of dimensionality? Comput. Math. Appl. 65, 1427–1437. (doi:10.1016/j.camwa.2012.09.011)
 [24] Cucker F, Smale S. 2002 On the mathematical foundations of learning. Bull. Amer. Math. Soc., 39, 1–49. (doi:10.1090/S0273097901009235)
 [25] Friedman J, Hastie T, Tibshirani R. 2009 The Elements of Statistical Learning. New York: Springer. (doi:10.1007/9780387848587)

[26]
Vapnik V. 2000
The Nature of Statistical Learning Theory
. New York: Springer. (doi:10.1007/9781475732641)  [27] Xu R, Wunsch D. 2008. Clustering. Hoboken: John Wiley & Sons. (doi:10.1002/9780470382776)
 [28] Gorban AN, Kégl B, Wunsch D, Zinovyev A. (Eds.) 2008 Principal Manifolds for Data Visualisation and Dimension Reduction. Lect. Notes Comput. Sci. Eng., Vol. 58. Berlin – Heidelberg: Springer. (doi:10.1007/9783540737506)
 [29] Gorban AN, Zinovyev A. 2010 Principal manifolds and graphs in practice: from molecular biology to dynamical systems. Int. J. Neural Syst. 20, 219–232. (doi:10.1142/S0129065710002383)

[30]
Hyvärinen A, Oja E. 2000 Independent component analysis: algorithms and applications.
Neural Netw. 13, 41–430. (doi:10.1016/S08936080(00)000265)  [31] Mirkin B. 2012 Clustering: A Data Recovery Approach. Boca Raton: CRC Press. (doi:10.1201/b13101)
 [32] Gorban AN, Mirkes EM, Zinovyev A. 2016. Piecewise quadratic approximations of arbitrary error functions for fast and robust machine learning. Neural Netw. 84 (2016), 28–38. (doi:10.1016/j.neunet.2016.08.007)
 [33] Gromov M. 2011 Structures, Learning and Ergosystems: Chapters 14, 6. IHES, BuressurIvette, ÎledeFrance. See http://www.ihes.fr/gromov/PDF/ergobrain.pdf.
 [34] Engel A, Van den Broeck C. 2001 Statistical Mechanics of Learning. Cambridge, UK: Cambridge University Press.
 [35] Wang D, Li M. 2017 Stochastic configuration networks: Fundamentals and algorithms.IEEE Trans. On Cybernetics 47, 3466–3479. (doi:10.1109/TCYB.2017.2734043)
 [36] Gorban AN. 1990 Training Neural Networks, Moscow: USSRUSA JV “ParaGraph”. (doi:10.13140/RG.2.1.1784.4724)
 [37] De Freitas N, Andrieu C, HøjenSørensen P, Niranjan M, Gee A. 2001 Sequential Monte Carlo methods for neural networks. In Sequential Monte Carlo Methods in Practice. New York: Springer, 359–379. (doi:10.1007/9781475734379_17)
 [38] Scardapane S, Wang D. 2017 Randomness in neural networks: an overview. WIREs Data Mining Knowl. Discov. 7, e1200. (doi:doi.org/10.1002/widm.1200)
 [39] Fisher RA. 1936 The use of multiple measurements in taxonomic problems. Ann. Hum. Genet. 7, 179–188. (doi:10.1111/j.14691809.1936.tb02137.x)
 [40] Rosenblatt F. 1962 Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms. Washington DC: Spartan Books. See http://www.dtic.mil/docs/citations/AD0256582.
 [41] Duda RD, Hart PE, and Stork DG. 2012 Pattern classification. New York: John Wiley and Sons.
 [42] Aggarwal CC. 2015 Data Mining: The Textbook. Cham – Heidelberg – New York – Dordrecht – London: Springer. (doi:10.1007/9783319141428)
 [43] Kůrková V, Sanguineti M. 2017 Probabilistic lower bounds for approximation by shallow perceptron networks. Neural Netw. 91, 34–41. (doi:10.1016/j.neunet.2017.04.003)
 [44] Johnson WB, Lindenstrauss J. 1984 Extensions of Lipschitz mappings into a Hilbert space. Contemp. Math. 26, 189–206. (doi:10.1090/conm/026/737400)
 [45] Dasgupta S, Gupta A. 2003 An elementary proof of a theorem of Johnson and Lindenstrauss. Random Structures & Algorithms 22, 60–65. (doi:10.1002/rsa.10073)
 [46] Gorban AN, Tyukin I, Prokhorov D, Sofeikov K. 2016 Approximation with random bases: Pro et contra. Inf. Sci. 364–365, 129–145. (doi:10.1016/j.ins.2015.09.021)
 [47] Kainen P, Kůrková V. 1993 Quasiorthogonal dimension of Euclidian spaces. Appl. Math. Lett. 6, 7–10. (doi:10.1016/08939659(93)90023G)
 [48] Gorban AN, Tyukin IY, Romanenko I. 2016. The blessing of dimensionality: Separation theorems in the thermodynamic limit. IFACPapersOnLine 49, 64–69. See doi:10.1016/j.ifacol.2016.10.755).
 [49] Gorban AN, Tyukin IY. 2017 Stochastic separation theorems. Neural Netw. 94, 255–259. (doi:10.1016/j.neunet.2017.07.014)
 [50] Chapelle O. 2007 Training a Support Vector Machine in the Primal. Neural Comput. 19, 1155–1178. doi:10.1162/neco.2007.19.5.1155)
 [51] Hoeffding W. 1963. Probability inequalities for sums of bounded random variables. J. Amer. Statist. Assoc. 301, 13–30. (doi:10.1080/01621459.1963.10500830)
 [52] Tyukin IY, Gorban AN, Sofeikov K, Romanenko I. 2017 Knowledge transfer between artificial intelligence systems. See arXiv:1709.01547 [cs.AI].
 [53] Jia Y. 2013 Caffe: An open source convolutional architecture for fast feature embedding. See http://caffe.berkeleyvision.org/.
 [54] Chen T, Li M, Li Y, Lin M, Wang N, Xiao T, Xu B, Zhang C, Zhang Z. 2015 MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems. See https://github.com/dmlc/mxnet.

[55]
Team DD. 2016
Deeplearning4j: Opensource distributed deep learning for the JVM
, Apache Software Foundation License 2.0. See http://deeplearning4j.org.  [56] Abadi M, Agarwal A, Barham P et al. 2015 TensorFlow: Largescale machine learning on heterogeneous systems. Software available from An opensource software library for Machine Intelligence. See https://www.tensorflow.org/.
 [57] Buchtala O, Sick B. 2007 Basic technologies for knowledge transfer in intelligent systems. In Artificial Life, ALIFE’07. New York: IEEE Press. 251–258. (10.1109/ALIFE.2007.367804)
 [58] Tyukin IY, Gorban AN, Calvo C, Makarova J, Makarov VA. 2017 Highdimensional brain. A tool for encoding and rapid learning of memories by single neurons. See arXiv:1710.11227 [qbio.NC].
 [59] Ison MJ, Quian Quiroga R, Fried I. 2015 Rapid encoding of new memories by individual neurons in the human brain. Neuron 87:220–230. (doi:10.1016/j.neuron.2015.06.016)
 [60] Quian Quiroga R, Reddy, L, Kreiman, G, Koch, C, Fried, I. 2005 Invariant visual representation by single neurons in the human brain. Nature 435, 1102–1107. (doi:10.1038/nature03687)
 [61] Viskontas IV, Quian Quiroga R, Fried, I. 2009. Human medial temporal lobe neurons respond preferentially to personally relevant images. Proc. Nat. Acad. Sci. 106, 21329–21334. (doi:10.1073/pnas.0902319106)
 [62] Quian Quiroga, R. 2012 Concept cells: the building blocks of declarative memory functions. Nat. Rev. Neurosci. 13, 587–597. (doi:10.1038/nrn3251)
 [63] Oja E. 1982 A simplified neuron model as a principal component analyzer. J. Math. Biol. 15, 267–273. (doi:10.1007/BF00275687)
 [64] Corry L. 1997 David Hilbert and the axiomatization of physics (1894–1905). Arch. Hist. Exact Sci. 51, 83–198. (doi:10.1007/BF00375141)
 [65] Wightman AS. 1976 Hilbert’s sixth problem: Mathematical treatment of the axioms of physics. In Browder FE (ed.). Mathematical Developments Arising from Hilbert Problems. Proceedings of Symposia in Pure Mathematics. XXVIII. AMS, 147–240. (doi:10.1090/pspum/028.1/0436800)
 [66] Kégl B. 2003 Intrinsic dimension estimation using packing numbers. In Advances in neural information processing systems 15 (NIPS 2002), Cambridge, US: MIT Press, 697–704.
 [67] Levina E, Bickel PJ. 2005 Maximum likelihood estimation of intrinsic dimension. In Advances in neural information processing systems 17 (NIPS 2004). Cambridge, US: MIT Press, 777–784.
 [68] Bengio Y. 2009 Learning deep architectures for AI. Found. Trends Mach. Learn. 2, 1–127. (doi:10.1561/2200000006)
 [69] Moczko E, Mirkes EM, Ceceres C, Gorban AN, Piletsky S. 2016 Fluorescencebased assay as a new screening tool for toxic chemicals. Sci. Rep. 6, 33922. (doi:10.1038/srep33922)
 [70] Cangelosi R, Goriely A. 2007 Component retention in principal component analysis with application to cDNA microarray data. Biol. Direct 2, 2. (doi:10.1186/1745615022)
 [71] Zinovyev A, Mirkes E. 2013. Data complexity measured by principal graphs. Comput. Math. Appl. 65, 1471–1482. (doi:10.1016/j.camwa.2012.12.009)
Comments
There are no comments yet.