Correction of AI systems by linear discriminants: Probabilistic foundations

11/11/2018
by   Alexander N. Gorban, et al.
University of Leicester
0

Artificial Intelligence (AI) systems sometimes make errors and will make errors in the future, from time to time. These errors are usually unexpected, and can lead to dramatic consequences. Intensive development of AI and its practical applications makes the problem of errors more important. Total re-engineering of the systems can create new errors and is not always possible due to the resources involved. The important challenge is to develop fast methods to correct errors without damaging existing skills. We formulated the technical requirements to the 'ideal' correctors. Such correctors include binary classifiers, which separate the situations with high risk of errors from the situations where the AI systems work properly. Surprisingly, for essentially high-dimensional data such methods are possible: simple linear Fisher discriminant can separate the situations with errors from correctly solved tasks even for exponentially large samples. The paper presents the probabilistic basis for fast non-destructive correction of AI systems. A series of new stochastic separation theorems is proven. These theorems provide new instruments for fast non-iterative correction of errors of legacy AI systems. The new approaches become efficient in high-dimensions, for correction of high-dimensional systems in high-dimensional world (i.e. for processing of essentially high-dimensional data by large systems).

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

02/06/2018

Augmented Artificial Intelligence: a Conceptual Framework

All artificial Intelligence (AI) systems make errors. These errors are u...
10/03/2016

One-Trial Correction of Legacy AI Systems and Stochastic Separation Theorems

We consider the problem of efficient "on the fly" tuning of existing, or...
11/08/2018

Intrinsic Geometric Vulnerability of High-Dimensional Artificial Intelligence

The success of modern Artificial Intelligence (AI) technologies depends ...
02/06/2018

Augmented Artificial Intelligence

All artificial Intelligence (AI) systems make errors. These errors are u...
01/10/2018

Blessing of dimensionality: mathematical foundations of the statistical physics of data

The concentration of measure phenomena were discovered as the mathematic...
09/30/2019

Blessing of dimensionality at the edge

In this paper we present theory and algorithms enabling classes of Artif...
09/20/2018

The unreasonable effectiveness of small neural ensembles in high-dimensional brain

Despite the widely-spread consensus on the brain complexity, sprouts of ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

1.1 Errors and correctors of AI systems

State-of-the art Artificial Intelligence (AI) systems for data mining consume huge and fast-growing collections of heterogeneous data. Multiple versions of these huge-size systems have been deployed to date on millions of computers and gadgets across many various platforms. Inherent uncertainties in data result in unavoidable mistakes (e.g. mislabelling, false alarms, misdetections, wrong predictions etc.) of the AI data mining systems, which require judicious use. These mistakes become gradually more important because of numerous and increasing number of real life AI applications in such sensitive areas as security, health care, autonomous vehicles and robots. Widely advertised success in testing of AIs in the laboratories often can not be reproduced in realistic operational conditions. Just for example, Metropolitan Police’s facial recognition matches (‘positives’) are reported 98% inaccurate (false positive), and South Wales Police’s matches are reported 91% inaccurate

Foxx (2018). Later on, experts in statistics mentioned that ‘figures showing inaccuracies of 98% and 91% are likely to be a misunderstanding of the statistics and are not verifiable from the data presented’ face2018 . Nevertheless, the large number of false positive recognitions of ‘criminals’ leads to serious concerns about security of AI use because people have to prove their innocence as police are wrongly identifying thousands of innocent citizens as criminals.

The successful functioning of any AI system in realistic operational condition dictates that mistakes must be detected and corrected immediately and locally in the networks of collaborating systems. Real-time correction of the mistakes by re-training is not always viable due to the resources involved. Moreover, the re-training could introduce new mistakes and damage existing skills. All AI systems make errors. Correction of these errors is gradually becoming an increasingly important problem.

The technical requirements to the ‘ideal’ correctors can be formulated as follows Gorban and Tyukin (2018). Corrector should: (i) be simple; (ii) not damage the skills of the legacy system in the situations, where they are working successfully; (iii) allow fast non-iterative learning; and (iv) allow correction of the new mistakes without destroying the previous fixes.

Figure 1: Corrector of AI errors. Inputs for this corrector may include input signals, and any internal or output signal of the AI system (marked by circles).

The recently proposed architecture of such a corrector is simple Gorban, Tyukin et al. (2016). It consists of two ideal devices:

  • A binary classifier for separation of the situations with possible mistakes form the situations with correct functioning (or, more advanced, separation of the situations with high risk of mistake from the situations with low risk of mistake);

  • A new decision rule for the situations with possible mistakes (or with high risk of mistakes).

A binary classifier is the main and universal part of the corrector for any AI system, independently of tasks it performs. The corrected decision rule is more specific.

Such a corrector is an external system, and the main legacy AI system remains unchanged (Fig. 1). One corrector can correct several errors (it is useful to cluster them before corrections). Cascades of correctors are employed for further correction of more errors Gorban and Tyukin (2018): the AI system with the first corrector is a new legacy AI system and can be corrected further (Fig. 2).

Figure 2: Cascade of AI correctors. In this diagram, the original legacy AI system (shown as Legacy AI System 1) is supplied with a corrector altering its responses. The combined new AI system can in turn be augmented by another corrector, leading to a cascade of AI correctors.

1.2 Correctors and blessing of dimensionality

Surprisingly, if the dimension of the data is high enough, then the classification problem in corrector construction can be solved by simple linear Fisher’s discriminants even if the data sets are exponentially large with respect to dimension. This phenomenon is the particular case of the blessing of dimensionality. This term was coined Kainen (1997); Donoho (2000)

as an antonym of the ‘curse of dimensionality’ for the group of high-dimension geometric phenomena which simplify data mining in high dimensions.

Both curse and blessing of dimensionality are manifestations of the measure concentration phenomena, which were discovered in the foundation of statistical physics and analysed further in the context of geometry, functional analysis, and probability theory (reviewed by

GiannopoulosMilman2000 ; Gorban and Tyukin (2018); Ledoux (2005)). The ‘sparsity’ of high-dimensional spaces and concentration of measure phenomena make some low-dimensional approaches impossible in high dimensions. This problem is widely known as the ‘curse of dimensionality’ Trunk1979 ; Donoho (2000); Pestov2013 . The same phenomena can be efficiently employed for creation of new, high-dimensional methods, which seem to be much simpler in high dimensions than the low-dimensional approaches. This is the ‘blessing of dimensionality’ Kainen (1997); Donoho (2000); Anderson et al. (2014); Gorban, Tyukin et al. (2016); Gorban and Tyukin (2018).

Classical theorems about concentration of measure state that random points in a high-dimensional data distribution are concentrated in a thin layer near an average or median level set of a Lipschitz function (for introduction into this area we refer to Ledoux (2005)). The stochastic separation theorems Gorban, Romanenko et al. (2016); Gorban and Tyukin (2017)

revealed the fine structure of these thin layers: the random points are all linearly separable by simple Fisher’s discriminants from the rest of the set even for exponentially large random sets. Of course, the probability distribution should be ‘genuinely’ high-dimensional for all these concentration and separation theorems.

The correctors with higher abilities can be constructed on the basis of small neural networks with uncorrelated neurons

Gorban, Romanenko et al. (2016) but already single-neuron correctors (Fisher’s discriminants) can help in explanation of a wealth of empirical evidence related to in-vivo recordings of ‘Grandmother’ cells and ‘concept’ cells Gorban and Tyukin (2018); Tyukin, Gorban et al. (2018).

The theory and the concept have been tested in several case-studies of which the examples are provided in Gorban, Romanenko et al. (2016); Tyukin, Gorban et al. (2017). In these works, the underlying use-case was the problem of improving performance of legacy AI systems built for pedestrian detection in video streams. In Gorban, Romanenko et al. (2016)

we showed that spurious errors of a legacy Convolutional Neural Network with VGG-11 architecture could be learned away in a one-shot manner at a near-zero cost to exiting skills of the original AI

PatentRomanenkoGorTyu . In Tyukin, Gorban et al. (2017) we demonstrated that A) the process can be fully automated to allow an AI system (teacher AI) to teach another (student AI), and B) that the errors can be learned away in packages. The approach, i.e. Fisher discriminants, stays largely the same, albeit in the latter case cascaded pairs introduced in Gorban, Romanenko et al. (2016)

were used instead of mere single hyperplanes.

1.3 The contents of this paper

In this paper we aim to present probabilistic foundations for correction of errors in AI systems in high dimensions. We develop a new series of stochastic separation theorems, prove some hypotheses formulated in our paper Gorban and Tyukin (2017), answer one general problem about stochastic separation published by Donoho and Tanner Donoho and Tanner (2009), and specify general families of probability distributions with the stochastic separation properties (the measures with ‘SMeared Absolute Continuity’ property).

The first highly non-trivial question in the analysis of curse and blessing of dimensionality is: what is the dimensionality of data? What does it mean that ‘the dimension of the data is high enough’? Of course, this dimension does not coincide with the dimension of the data space and can be significantly lower. The appropriate definition of data dimension depends on the problem we need to solve. Here we would like to create linear or even Fisher’s classifier for separation of mistakes from the areas, where the legacy AI system works properly. Therefore, the desirable evaluations of data dimension should characterise and account for the possibility to solve this problem with high accuracy and low probability of mistakes. We will return to the evaluation of this probability in all sections of the paper.

In Sec. 2

we define the notions of linear separability and Fisher’s separability and present a basic geometric construction whose further development allows us to prove Fisher’s separability of high-dimensional data and to estimate the corresponding dimension of the data in the databases.

The standard assumption in machine learning is independence and identical distribution (i.i.d.) of data points

Vapnik (2000); Cucker and Smale S. (2002). On the other hand, in real operational conditions data are practically never i.i.d. Concept drift, non-identical distributions and various correlations with violation of independence are inevitable. In this paper we try to meet this non-i.i.d. challenge, partially, at least. In particular, the geometric constructions introduced in Sec. 2 and Theorem 1 do not use the i.i.d. hypothesis. This is an important difference from our previous results, where we assumed i.i.d. data.

In Sec. 3 we find the general conditions for existence of linear correctors (not compulsory Fisher’s discriminants). The essential idea is: the probability that a random data point will belong to a set with small volume (Lebesgue measure) should also be small (with precise specification, what ‘small’ means). Such a condition is a modified or ‘smeared’ property of absolute continuity (which means that the probability of a set of zero volume is zero). This general stochastic separation theorem (Theorem 2) gives the answer to a closely related question asked in Donoho and Tanner (2009).)

Existence of a linear classifier for corrector construction is a desirable property, and this classifier can be prepared using Support Vector Machine algorithms or Rosenblatt’s Perceptron learning rules. Nevertheless, Fisher’s linear discriminants seem to be more efficient because they are non-iterative and robust. In Sections 

4-6 we analyse Fisher’s separability in high dimension. In particular, we prove stochastic separability thorems for log-concave distributions (Sec. 5), and for non-i.i.d. distributions of data points (Sec. 6).

Stochastic separability of real databases is tested in Sec. 7. Popular open access database is used. We calculate, in particular, the probability distribution of that is the probability that a randomly chosen data point cannot be separated by Fisher’s discriminant from a data point . The value depends on a random data point

and is a random variable.

The probability that a randomly chosen data point cannot be separated by Fisher’s discriminant from a data point depends on a random data point

and is a random variable. Probability distribution of this variable characterises the separability of the data set. We evaluate distribution and moments of

and use them for the subsequent analysis of data. The comparison of the mean and variance of

with these parameters, calculated for equidistributions in a -dimensional ball or sphere, gives us the idea about what is the real dimension of data. There are many different approaches for evaluation of data dimension. Each definition is needed for specific problems. Here we introduce and use new approach to data dimension from the Fisher separability analysis point of view.

1.4 Historical context

The curse of dimensionality is a well-know idea introduced by Bellman in 1957 Bellman1957 . He considered the problem of multidimensional optimisation and noticed that ‘the effective analytic of a large number of even simple equations, for example, linear equations, is a difficult affair’ and the determination of the maximum ‘is quite definitely not routine when the number of variables is large’. He used the term ‘curse’ because it ‘has hung over the head’ for many years and ‘there is no need to feel discouraged about the possibility of obtaining significant results despite it.’ Many other effects were added to that idea during decades especially in data analysis Pestov2013 . The idea of ‘blessing of dimensionality’ was expressed much later Kainen (1997); Donoho (2000); Anderson et al. (2014); ChenEtAl2013 .

In 2009, Donoho and Tanner described the blessing of dimensionality effects as three surprises. The main of them is linear separability of a random point from a large finite random set with high probability Donoho and Tanner (2009). They proved this effect for high-dimensional Gaussian i.i.d. samples. In more general settings, this effect was supported by many numerical experiments. This separability was discussed as a very surprising property: ‘For humans stuck all their lives in three-dimensional space, such a situation is hard to visualize.’ These effects have deep connections with the backgrounds of statistical physics and modern geometric functional analysis Gorban and Tyukin (2018).

In 2016, we added the surprise number four: this linear separation may be performed by linear Fisher’s discriminant Gorban, Tyukin et al. (2016)

. This result allows us to decrease significantly the complexity of the separation problem: non-iterative (one-shot) approach avoids solution of ‘a large number of even simple’ problems and provides a one more step from Bellman’s curse to the modern blessing of dimensionality. The first theorems were proved in simple settings: i.i.d. samples from the uniform distribution in a multidimensional ball

Gorban, Tyukin et al. (2016). The next step was extension of these theorems onto i.i.d. samples from the bounded product distributions Gorban and Tyukin (2017). The statements of these two theorems are cited below in Sec. 4 (for proofs we refer to Gorban and Tyukin (2017)

). These results have been supported by various numerical experiments and applications. Nevertheless, the distributions of real life data are far from being i.i.d. samples. They are not distributed uniformly in a ball. The hypothesis of the product distribution is also unrealistic despite of its popularity (in data mining it is known as the ‘naive Bayes’ assumption).

We formulated several questions and hypotheses in Gorban and Tyukin (2017). First of all, we guessed that all essentially high-dimensional distributions have the same linear (Fisher’s) separability properties as the equidistribution in a high-dimensional ball. The question was: how to characterise the class of these essentially high-dimensional distributions? We also proposed several specific hypotheses about the classes of distributions with linear (Fisher’s) separability property. The most important was the hypothesis about log-concave distributions. In this paper, we answer the question for characherisation of essentially high dimension distributions both for Fisher’s separability (distributions with bounded support, which satisfy estimate (4), Theorem 1) and general linear separability (SmAC measures, Definition 4, Theorem 2, this result answers also to the Donoho-Tanner question). The hypothesis about log-concave-distributions is proved (Theorem 5). We try to avoid the i.i.d. hypothesis as far as it was possible (for example, in Theorem 1 and special Sec. 6). Some of these results were announced in preprint GorbanGrechukTyukin2018 .

Several important aspects of the problem of stochastic separation in machine learning remain outside the scope of this paper. First of all, we did not discuss the modern development of the stochastic separability theorems for separation by simple non-linear classifies like small cascades of neurons with independent synaptic weights. Such separation could be orders more effective than the linear separation and still uses non-iterative one-shot algorithms. It was introduced in Gorban, Tyukin et al. (2016). The problem of separation of sets of data points (not only a point from a set) is important for applications in machine learning and knowledge transfer between AI systems. The generalised stochastic separation theorems of data sets give the possibility to organise the knowledge transfer without iterations Tyukin, Gorban et al. (2017).

An alternative between essentially high-dimensional data with thin shell concentrations, stochastic separation theorems and efficient linear methods on the one hand, and essentially low-dimensional data with possibly efficient complex nonlinear methods on the other hand, was discussed in Gorban and Tyukin (2018). These two cases could be joined: first, we can extract the most interesting low-dimensional structure and then analyse the residuals as an essentially high-dimensional random set, which obeys stochastic separation theorems.

The trade-off between simple models (and their ‘interpretability’) and more complex non-linear black-box models (and their ‘fidelity’) was discussed by many authors. Ribeiro at al. proposed to study trade-off between local linear and global nonlinear classifiers as a basic example Ribeiro2016 . Results on stochastic separation convince that the simple linear discriminants are good global classifiers for high-dimensional data, and complex relations between linear and nonlinear models can reflect the relationships between low-dimensional and high-dimensional components in the variability of the data.

Very recently (after this paper was submitted), a new approach for analysis of classification reliability by the ‘trust score’ is proposed Jiang2018

. The ‘trust score’ is closely connected to a heuristic estimate of the dimension of the data cloud by identification of the ‘high-density-sets’, the sets that have high density of points nearby. The samples with low density nearby are filtered out. In the light of the stochastic separation theorems, this approach can be considered as the extraction of the essentially low-dimensional fraction, the points, which are concentrated near a low-dimensional object (a low-dimensional manifold, for example). If this fraction is large and the effective dimension is small, then the low-dimensional methods can work efficiently. If the low-dimensional fraction is low (or effective dimension is high), then the low-dimensional methods can fail but we enjoy in these situations the blessing of dimensionality with Fisher’s discriminants.

2 Linear discriminants in high dimension: preliminaries

Throughout the text, is the -dimensional linear real vector space. Unless stated otherwise, symbols denote elements of , is the inner product of and , and is the standard Euclidean norm in . Symbol stands for the unit ball in centered at the origin: . is the -dimensional Lebesgue measure, and is the volume of unit ball. is the unit sphere in . For a finite set , the number of points in is .

Definition 1.

A point is linearly separable from a set , if there exists a linear functional such that for all .

Definition 2.

A set is linearly separable or 1-convex Bárány and Füredi (1988) if for each there exists a linear functional such that for all , .

Recall that a point is an extreme point of a convex compact if there exist no points , such that . The basic examples of linearly separable sets are extreme points of convex compacts: vertices of convex polyhedra or points on the -dimensional sphere. Nevertheless, the sets of extreme points of a compact may be not linearly separable as is demonstrated by simple 2D examples Simon (2011).

If it is known that a point is linearly separable from a finite set then the definition of the separating functional may be performed by linear Support Vector Machine (SVM) algorithms, the Rosenblatt perceptron algorithm or other methods for solving of linear inequalities. These computations may be rather costly and robustness of the result may not be guaranteed.

With regards to computational complexity, the worst-case estimate for SVM is Bordes et al. (2005); Chapelle (2007), where is the number of elements in the dataset. In practice, however, complexity of the soft-margin quadratic support vector machine problem is between and , depending on parameters and the problem at hand Bordes et al. (2005). On the other hand, classical Fisher’s discriminant requires elementary operations to construct covariance matrices followed by operations needed for the matrix inversion, where is data dimension.

Fisher’s linear discriminant is computationally cheap (after the standard pre-processing), simple, and robust.

We use a convenient general scheme for creation of Fisher’s linear discriminants Tyukin, Gorban et al. (2017); Gorban and Tyukin (2018). For separation of single points from a data cloud it is necessary:

  1. Centralise the cloud (subtract the mean point from all data vectors).

  2. Escape strong multicollinearity, for example, by principal component analysis and deleting minor components, which correspond to the small eigenvalues of empirical covariance matrix.

  3. Perform whitening (or spheric transformation), that is a linear transformation, after that the covariance matrix becomes the identity matrix. In principal components, whitening is simply the normalisation of coordinates to unit variance.

  4. The linear inequality for separation of a point from the cloud in new coordinates is

    (1)

    for all , where is a threshold.

In real-life problems, it could be difficult to perform the precise whitening but a rough approximation to this transformation could also create useful discriminants (1). We will call ‘Fisher’s discriminants’ all the discriminants created non-iteratively by inner products (1), with some extension of meaning.

Definition 3.

A finite set is Fisher-separable with threshold if inequality (1) holds for all such that . The set is called Fisher-separable if there exists some such that is Fisher-separable with threshold .

Fisher’s separability implies linear separability but not vice versa.

Inequality (1) holds for vectors , if and only if does not belong to a ball (Fig. 3) given by the inequality:

(2)

The volume of such balls can be relatively small.

For example, if is a subset of , then the volume of each ball (2) does not exceed . Point is separable from a set by Fisher’s linear discriminant with threshold if it does not belong to the union of these excluded balls. The volume of this union does not exceed

Assume that . If with then the fraction of excluded volume in the unit ball decreases exponentially with dimension as .

Figure 3: Diameter of the filled ball (excluded volume) is the segment . Point should not belong to the excluded volume to be separable from by the linear discriminant (1) with threshold . Here, is the origin (the centre), and is the hyperplane such that for . A point should not belong to the union of such balls for all for separability from a set .
Proposition 1.

Let , be a finite set, , and be a randomly chosen point from the equidistribution in the unit ball. Then with probability point is Fisher-separable from with threshold (1).

This proposition is valid for any finite set and without any hypotheses about its statistical nature. No underlying probability distribution is assumed. This the main difference from (Gorban and Tyukin, 2017, Theorem 1).

Let us make several remarks for the general distributions. The probability that a randomly chosen point is not separated a given data point from by the discriminant (1) (i.e. that the inequality (1) is false) is (Fig. 3)

(3)

where is the probability measure. We need to evaluate the probability of finding a random point outside the union of such excluded volumes. For example, for the equidistribution in a ball , , and arbitrary from the ball. The probability to select a point inside the union of ‘excluded balls’ is less than for any allocation of points in .

Instead of equidistribution in a ball , we can consider probability distributions with bounded density in a ball

(4)

where is an arbitrary constant, is the volume of the ball, and . This inequality guarantees that the probability (3) for each ball with the radius less or equal than exponentially decays for . It should be stressed that the constant is arbitrary but must not depend on

in asymptotic analysis for large

.

According to condition (4), the density is allowed to deviate significantly from the uniform distribution, of which the density is constant, . These deviations may increase with but not faster than the geometric progression with the common ratio .

The separability properties of distributions, which satisfy the inequality (4), are similar to separability for the equidistributions. The proof, through the estimation of the probability to pick a point in the excluded volume is straightforward.

Theorem 1.

Let , , , be a finite set, , and be a randomly chosen point from a distribution in the unit ball with the bounded probability density . Assume that satisfies inequality (4). Then with probability point is Fisher-separable from with threshold (1).

Proof.

The volume of the ball (2) does not exceed for each . The probability that point belongs to such a ball does not exceed

The probability that belongs to the union of such balls does not exceed . For this probability is smaller than and .

For many practically relevant cases, the number of points is large. We have to evaluate the sum for points . For most estimates, evaluation of the expectations and variances of (3

) is sufficient due to the central limit theorem, if points

are independently chosen. They need not, however, be identically distributed. The only condition of Fisher’s separability is that all distributions satisfy inequality (4).

For some simple examples, the moments of distribution of (3) can be calculated explicitly. Select and consider uniformly distributed in the unit ball . Then, for a given , and for :

and for . This is the uniform distribution on the interval . .

For the equidistribution on the unit sphere and given threshold

(5)

(it does not depend on ), where , is the area (i.e. the -dimensional volume) of the -dimensional unit sphere and is the gamma-function. Of course, for for this distribution because the sphere is obviously linearly separable. But the threshold matters and for .

For large ,

Therefore,

when .

The simple concentration arguments (‘ensemble equivalence’) allows us to approximate the integral (5) with the one in which . The relative error of this restriction is with exponentially small (in ) for a given . In the vicinity of

We apply the same concentration argument again to the simplified integral

and derive a closed form estimate: for a given and large ,

Therefore, for the equidistribution on the unit sphere ,

(6)

Here means (for strictly positive functions) that when . The probability is the same for all . It is exponentially small for large .

If the distributions are unknown and exist just in the form of the samples of empirical points, then it is possible to evaluate and (3) (and the highest moments) from the sample directly, without knowledge of theoretical probabilities. After that, it is possible to apply the central limit theorem and to evaluate the probability that a random point does not belong to the union of the balls (2) for independently chosen points .

The ‘efficient dimension’ of a data set can be evaluated by comparing for this data set to the value of for the equidistributions on a ball, a sphere, or the Gaussian distribution. Comparison to the sphere is needed if data vectors are normalised to the unit length of each vector (a popular normalisation in some image analysis problems).

Theorem 1, the simple examples, and the previous results Gorban and Tyukin (2017) allow us to hypotetise that for essentially high dimensional data it is possible to create correctors of AI systems using simple Fisher’s discriminants for separation of areas with high risk of errors from the situations, where the systems work properly. The most important question is: what are the ‘essentially high dimensional’ data distributions? In the following sections we try to answer this question. The leading idea is: the sets of ‘very’ small volume should not have ‘too high’ probability. Specification of these ‘very’ is the main task.

In particular, we find that the hypothesis about stochastic separability for general log-concave distribution proposed in Gorban and Tyukin (2017) is true. A general class of probability measures with linear stochastic separability is described (the SmAC distributions). This question was asked in 2009 Donoho and Tanner (2009). We demonstrate also how the traditional machine learning assumption about i.i.d. (independent identically distributed) data points can be significantly relaxed.

3 General stochastic separation theorem

Bárány and Füredi Bárány and Füredi (1988) studied properties of high-dimensional polytopes deriving from uniform distribution in the -dimensional unit ball. They found that in the envelope of random points all of the points are on the boundary of their convex hull and none belong to the interior (with probability greater than , provided that , where in an arbitrary constant). They also show that the bound on is nearly tight, up to polynomial factor in . Donoho and Tanner Donoho and Tanner (2009)

derived a similar result for i.i.d. points from the Gaussian distribution. They also mentioned that in applications it often seems that Gaussianity is not required and stated the problem of characterisation of ensembles leading to the same qualitative effects (‘phase transitions’), which are found for Gaussian polytopes.

Recently, we noticed that these results could be proven for many other distributions, indeed, and one more important (and surprising) property is also typical: every point in this -point random set can be separated from all other points of this set by the simplest linear Fisher discriminant Gorban, Tyukin et al. (2016); Gorban and Tyukin (2017). This observation allowed us Gorban, Romanenko et al. (2016) to create the corrector technology for legacy AI systems. We used the ‘thin shell’ measure concentration inequalities to prove these results Gorban and Tyukin (2017, 2018). Separation by linear Fisher’s discriminant is practically most important Surprise 4 in addition to three surprises mentioned in Donoho and Tanner (2009).

The standard approach assumes that the random set consists of independent identically distributed (i.i.d.) random vectors. The new stochastic separation theorem presented below does not assume that the points are identically distributed. It can be very important: in the real practice the new data points are not compulsory taken from the same distribution that the previous points. In that sense the typical situation with the real data flow is far from an i.i.d. sample (we are grateful to G. Hinton for this important remark). This new theorem gives also an answer to the open problem Donoho and Tanner (2009): it gives the general characterisation of the wide class of distributions with stochastic separation theorems (the SmAC condition below). Roughly speaking, this class consists of distributions without sharp peaks in sets with exponentially small volume (the precise formulation is below). We call this property “SMeared Absolute Continuity” (or SmAC for short) with respect to the Lebesgue measure: the absolute continuity means that the sets of zero measure have zero probability, and the SmAC condition below requires that the sets with exponentially small volume should not have high probability.

Consider a family of distributions, one for each pair of positive integers and . The general SmAC condition is

Definition 4.

The joint distribution of

has SmAC property if there are exist constants , , and , such that for every positive integer , any convex set such that

any index , and any points in , we have

(7)

We remark that

  • We do not require for SmAC condition to hold for all , just for some . However, constants , , and should be independent from and .

  • We do not require that are independent. If they are, (7) simplifies to

  • We do not require that are identically distributed.

  • The unit ball in SmAC condition can be replaced by an arbitrary ball, due to rescaling.

  • We do not require the distribution to have a bounded support - points are allowed to be outside the ball, but with exponentially small probability.

The following proposition establishes a sufficient condition for SmAC condition to hold.

Proposition 2.

Assume that are continuously distributed in with conditional density satisfying

(8)

for any , any index , and any points in , where and are some constants. Then SmAC condition holds with the same , any , and .

Proof.

If are independent with having density , (8) simplifies to

(9)

where and are some constants.

With , (9) implies that SmAC condition holds for probability distributions whose density is bounded by a constant times density of uniform distribution in the unit ball. With arbitrary , (9) implies that SmAC condition holds whenever ration grows at most exponentially in . This condition is general enough to hold for many distributions of practical interest.

Example 1.

(Unit ball) If are i.i.d random points from the equidistribution in the unit ball, then (9) holds with .

Example 2.

(Randomly perturbed data) Fix parameter (random perturbation parameter). Let be the set of arbitrary (non-random) points inside the ball with radius in . They might be clustered in arbitrary way, all belong to a subspace of very low dimension, etc. Let be a point, selected uniformly at random from a ball with center and radius . We think about as “perturbed” version of . In this model, (9) holds with , .

Example 3.

(Uniform distribution in a cube) Let be i.i.d random points from the equidistribution in the unit cube. Without loss of generality, we can scale the cube to have side length . Then (9) holds with .

Remark 1.

In this case,

where means Stirling’s approximation for gamma function .

Example 4.

(Product distribution in unit cube) Let be independent random points from the product distribution in the unit cube, with component of point having a continuous distribution with density . Assume that all are bounded from above by some absolute constant . Then (9) holds with (after appropriate scaling of the cube).

Below we prove the separation theorem for distributions satisfying SmAC condition. The proof is based on the following result from [Bárány and Füredi (1988a).

Proposition 3.

Let

where conv denotes the convex hull. Then

Proposition 3 implies that for every , there exists a constant , such that

(10)
Theorem 2.

Let be a set of i.i.d. random points in from distribution satisfying SmAC condition. Then is linearly separable with probability greater than , , provided that

where

Proof.

If , then , a contradiction. Let , and let . Then

where the second inequality follows from (7). Next,

For set

where we have used (10) and inequalities , . Then SmAC condition implies that

Hence,

and

where the last inequality follows from , . ∎

4 Stochastic separation by Fisher’s linear discriminant

According to the general stochastic separation theorems there exist linear functionals, which separate points in a random set (with high probability and under some conditions). Such a linear functional can be found by various iterative methods. This possibility is nice but the non-iterative learning is more beneficial for applications. It would be very desirable to have an explicit expression for separating functionals.

Theorem 1 and two following theorems Gorban and Tyukin (2017) demonstrate that Fisher’s discriminants are powerful in high dimensions.

Theorem 3 (Equidistribution in Gorban and Tyukin (2017)).

Let be a set of i.i.d. random points from the equidustribution in the unit ball . Let , and . Then

(11)
(12)
(13)

According to Theorem 3, the probability that a single element from the sample is linearly separated from the set by the hyperplane is at least

This probability estimate depends on both and dimensionality . An interesting consequence of the theorem is that if one picks a probability value, say , then the maximal possible values of for which the set remains linearly separable with probability that is no less than grows at least exponentially with . In particular, the following holds

Inequalities (11), (12), and (13) are also closely related to Proposition 1.

Corollary 1.

Let be a set of i.i.d. random points from the equidustribution in the unit ball . Let , and . If

(14)

then If

(15)

then

In particular, if inequality (15) holds then the set is Fisher-separable with probability .

Note that (13) implies that elements of the set are pair-wise almost or -orthogonal, i.e. for all , , with probability larger or equal than . Similar to Corollary 1, one can conclude that the cardinality of samples with such properties grows at least exponentially with . Existence of the phenomenon has been demonstrated in Kainen and Kůrková (1993). Theorem 3, Eq. (13), shows that the phenomenon is typical in some sense (cf. Gorban, Tyukin et al. (2016a), Kůrková and Sanguineti (2017)).

The linear separability property of finite but exponentially large samples of random i.i.d. elements can be proved for various ‘essentially multidimensional’ distributions. It holds for equidistributions in ellipsoids as well as for the Gaussian distributions Gorban, Romanenko et al. (2016). It was generalized in Gorban and Tyukin (2017) to product distributions with bounded support. Let the coordinates of the vectors in the set be independent random variables , with expectations and variances . Let for all .

Theorem 4 (Product distribution in a cube Gorban and Tyukin (2017)).

Let be i.i.d. random points from the product distribution in a unit cube. Let

Assume that data are centralised and . Then

(16)
(17)

In particular, under the conditions of Theorem 4, set is Fisher-separable with probability , provided that , where and are some constants depending only on and .

Concentration inequalities in product spaces Talagrand (1995) were employed for the proof of Theorem 4.

We can see from Theorem 4 that the discriminant (1) works without precise whitening. Just the absence of strong degeneration is required: the support of the distribution contains in the unit cube (that is bounding from above) and, at the same time, the variance of each coordinate is bounded from below by .

Linear separability single elements of a set from the rest by Fisher’s discriminants is a simple inherent property of high-dimensional data sets. The stochastic separation theorems were generalized further to account for -tuples, too Tyukin, Gorban et al. (2017); Gorban and Tyukin (2018).

Bound (4) is the special case of (9) with , and it is more restrictive: in Examples 2, 3, and 4, the distributions satisfy (9) with some , but fail to satisfy (4) if . Such distributions has the SmAC property and the corresponding set of points is linearly separable by Theorem 2, but different technique is needed to establish its Fisher-separability. One option is to estimate the distribution of in (3). Another technique is based on concentration inequalities. For some distributions, one can prove that, with exponentially high probability, random point satisfies

(18)

where and are some lower and upper bounds, depending on . If is small comparing to , it means that the distribution is concentrated in a thin shell between the two spheres. If and satisfy (18), inequality (1) may fail only if belongs to a ball with radius . If is much lower than , this method may provide much better probability estimate than (3). This is how Theorem 4 was proved in Gorban and Tyukin (2017).

5 Separation theorem for log-concave distributions

5.1 Log-concave distributions

Several possible generalisations of Theorems 3, 4 were proposed in Gorban and Tyukin (2017). One of them is the hypothesis that for the uniformly log-concave distributions the similar result can be formulated and proved. Below we demonstrate that this hypothesis is true, formulate and prove the stochastic separation theorems for several classes of log-concave distributions. Additionally, we prove the comparison (domination) Theorem 6 that allows to extend the proven theorems to wider classes of distributions.

In this subsection, we introduce several classes of log-concave distributions and prove some useful properties of these distributions.

Let be a family of probability measures with densities . Below, is a random variable (r.v) with density , and is the expectation of .

We say that density (and the corresponding probability measure ):

  • is whitened, or isotropic, if , and

    (19)

    The last condition is equivalent to the fact that the covariance matrix of the components of is the identity matrix Lovász and Vempala (2007).

  • is log-concave, if set is convex and is a convex function on .

  • is strongly log-concave (SLC), if is strongly convex, that is, there exists a constant such that

    For example, density of

    -dimensional standard normal distribution is strongly log-concave with

    .

  • has sub-Gaussian decay for the norm (SGDN), if there exists a constant such that

    (20)

    In particular, (20) holds for with any . However, unlike SLC, (20) is an asymptotic property, and is not affected by local modifications of the underlying density. For example, density , where and has SGDN with any , but it is not strongly log-concave.

  • has sub-Gaussian decay in every direction (SGDD), if there exists a constant such that inequality

    (21)

    holds for every and .

  • is with constant , , if

    (22)

    holds for every and all .

Proposition 4.

Let be an isotropic log-concave density, and let . The following implications hold.

where the last means the class of isotropic log-concave densities which are actually coincides with the class of all isotropic log-concave densities.

Proof.

Proposition 3.1 from Bobkov and Ledoux (2000) states that if there exists such that satisfies

(23)

for all such that , then inequality

(24)

holds for every smooth function on . As remarked in (Bobkov and Ledoux, 2000, p. 1035), “it is actually enough that (23) holds for some ”. With , this implies that (24) holds for every strongly log-concave distribution, with . According to (Bobkov, 1999, Theorem 3.1), (24) holds for if and only if it has has sub-Gaussian decay for the norm, and the implication follows.

According to Stavrakakis and Valettas (2013, Theorem 1(i)), if (24) holds for , then it is with constant , where is a universal constant, hence .

Lemma 2.4.4 from Brazitikos et al. (2014) implies that if log-concave is with constant then inequality

(25)

holds for all and all , with constant . Conversely, if (25) holds for all and all , then is with constant , where is a universal constant. For isotropic , (19) implies that (25) with simplifies to (21), and the equivalence follows.

The implications follow from (25).

Finally, according to (Brazitikos et al., 2014, Theorem 2.4.6), inequalities