# Augmented Artificial Intelligence

All artificial Intelligence (AI) systems make errors. These errors are unexpected, and differ often from the typical human mistakes ("non-human" errors). The AI errors should be corrected without damage of existing skills and, hopefully, avoiding direct human expertise. This talk presents an initial summary report of project taking new and systematic approach to improving the intellectual effectiveness of the individual AI by communities of AIs. We combine some ideas of learning in heterogeneous multiagent systems with new and original mathematical approaches for non-iterative corrections of errors of legacy AI systems.

## Authors

• 11 publications
• 9 publications
02/06/2018

### Augmented Artificial Intelligence: a Conceptual Framework

All artificial Intelligence (AI) systems make errors. These errors are u...
11/11/2018

### Correction of AI systems by linear discriminants: Probabilistic foundations

Artificial Intelligence (AI) systems sometimes make errors and will make...
10/12/2018

### Fast Construction of Correcting Ensembles for Legacy Artificial Intelligence Systems: Algorithms and a Case Study

This paper presents a technology for simple and computationally efficien...
02/11/2020

### Artificial Intelligence Assistance Significantly Improves Gleason Grading of Prostate Biopsies by Pathologists

While the Gleason score is the most important prognostic marker for pros...
05/05/2019

### Enhanced Labeling of Issue Reports (with F^3T)

Standard automatic methods for recognizing problematic code can be great...
05/27/2020

### AI Forensics: Did the Artificial Intelligence System Do It? Why?

In an increasingly autonomous manner AI systems make decisions impacting...
10/03/2016

### One-Trial Correction of Legacy AI Systems and Stochastic Separation Theorems

We consider the problem of efficient "on the fly" tuning of existing, or...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

The history of neural networks research can be represented as a series of explosions or waves of inventions and expectations. This history ensures us that the popular Gartner’s hype cycle for emerging technologies presented by the solid curve on Fig. 1 (see, for example [1]) should be supplemented by the new peak of expectation explosion (dashed line). Some expectations from the previous peak are realized and move to the “Plateau of Productivity” but the majority of them jump to the next “Peak of Inflated Expectations”. This observation relates not only to neural technologies but perhaps to majority of IT innovations. It is surprising to see, how expectations reappear in the new wave from the previous peak often without modifications, just changing the human carriers.

Computers and networks have been expected to augment the human intelligence [2]

. In 1998 one of the authors had been inspired by 8 years of success of knowledge discovery by deep learning neural network and by the transformation of their hidden knowledge into explicit knowledge in the form of “logically transparent networks”

[3]

by means of pruning, binarization and other simplification procedures

[4, 5], and wrote: “I am sure that the neural network technology of knowledge discovery is a ”point of growth”, which will remodel neuroinformatics, transform many areas of information technologies and create new approaches” [6]. Now it seems that this prediction will not be fulfilled: most customers do not care about gaining knowledge but prefer the “one button solutions”, which exclude humans from the process as far as it is possible. This is not a new situation in history. New intellectual technologies increase intellectual abilities of mankind, but not the knowledge of individual humans. Here, we can refer to Plato “There is an old Egyptian tale of Theuth, the inventor of writing, showing his invention to the god Thamus, who told him that he would only spoil men’s memories and take away their understandings” [7]. The adequate model of future Artificial Intelligence (AI) usage should include large communities of AI systems. Knowledge should circulate and grow in these communities. Participation of humans in these processes should be minimized. In the course of this technical revolution not the “Augmented human intellect” but the continuously augmenting AI will be created.

In this work, we propose the conceptual framework for augmenting AI in communities or “social networks” of AIs. For construction of such social networks, we employ several ideas in addition to the classical machine learning. The first of them is separation of the problem areas between small local (neural) experts, their competitive and collaborative learning, and conflict resolution. In 1991, two first papers with this idea were published simultaneously [8, 9]. The techniques for distribution of tasks between small local experts were developed. In our version of this technology [8] and in all our applied software [3, 10, 11, 12] the neural network answers were always complemented by the evaluation of the network self-confidence. This self-confidence level is an important instrument for community learning.

The second idea is the blessing of dimensionality [13, 14, 15, 16, 17] and the AI correction method [21] based on stochastic separation theorems [20]

. The “sparsity” of high-dimensional spaces and concentration of measure phenomena make some low-dimensional approaches impossible in high dimensions. This problem is widely known as the “curse of dimensionality”. Surprisingly, the same phenomena can be efficiently employed for creation of new, high-dimensional methods, which seem to be much simpler than in low dimensions. This is the “blessing of dimensionality”.

The classical theorems about concentration of measure state that random points in a highly-dimensional data distribution are concentrated in a thin layer near an average or median level set of a Lipschitz function (for introduction into this area we refer to

[18]). The newly discovered stochastic separation theorems [2]

017 revealed the fine structure of these thin layers: the random points are all linearly separable from the rest of the set even for exponentially large random sets. Of course, the probability distribution should be ‘genuinely’ high-dimensional for all these concentration and separation theorems.

Linear separability of exponentially large random subsets in high dimension allows us to solve the problem of nondestructive correction of legacy AI systems: the linear classifiers in their simplest Fisher’s form can separate errors from correct responses with high probability [21]. It is possible to avoid the standard assumption about independence and identical distribution of data points (i.i.d.). The non-iterative and nondestructive correctors can be employed for skills transfer in communities of AI systems [22].

These two ideas are joined in a special organisational environment of community learning which is organized in several phases:

• Initial supervising learning where community of newborn experts assimilate the knowledge hidden in labeled tasks from a problem-book (the problem-book is a continuously growing and transforming collection of samples);

• Non-iterative learning of community with self-labeling of real-life or additional training samples on the basis of separation of expertise between local experts, their continuous adaptation and mutual correction for the assimilation of gradual changes in reality.

• Interiorisaton of the results of the self-supervising learning of community in the internal skills of experts.

• Development and learning of special network manager that evaluates the level of expertise of the local experts for a problem and distributes the incoming task flow between them.

• Using an “ultimate auditor” to assimilate qualitative changes in the environment and correct collective errors; it may be human inspection, a feedback from real life, or another system of interference into the self-labeling process.

We describe the main constructions of this approach using the example of classification problems and simple linear correctors. The correctors with higher abilities can be constructed on the basis of small neural networks with uncorrelated neurons

[21] but already single-neuron correctors (Fisher’s discriminants) can help in explanation of a wealth of empirical evidence related to in-vivo recordings of “Grandmother” cells and “concept” cells [17, 23]. We pay special attention to the mathematical backgrounds of the technology and prove a series of new stochastic separation theorems. In particular, we find that the hypothesis about stochastic separability for general log-concave distribution [22] is true and describe a general class of probability measures with linear stochastic separability, the question was asked in 2009 by Donoho and Tanner [19].

## Ii Supervising stage: Problem Owners, Margins, Self-confidence, and Error functions

Consider binary classification problems. The neural experts with arbitrary internal structure have two outputs, out1 and out2, with interpretation: the sample belongs to class 1 if out1out2 and it belongs to class 2 if out1out2. For any given we can define the level of (self-)confidence in the classification answer as it demonstrated in Fig. 2. The owner of a sample is an expert that gives the best (correct and most confident) answer for this sample. If we assume the single owner for every sample then in the community functioning for problem solving this single owner gives the final result (Fig. 3).

We aim to train the community of agents in such a way that they will give correct self-confident answers to the samples they own, and do not make large mistakes on all other examples they never met before. The desired histogram of answers is presented in Fig. 3.

Learning is minimisation of error functionals, which is defined for any selected sample and any local expert. This error function should be different for owners and non-ofners of the sample. If we assume that each smalpe has a single owner then the error function presented in Fig. 5 can be used.

Voting of most self-confident experts (Fig. 6) can make the decision more stable. This voting may be organised with weights of votes, which depend on the individual experts’ level of confidence, or without weights, just as a simple voting. The modified error function for system with collective ownership (each sample has owners) is needed (Fig. 7). This function is constructed to provide proper answers of all owners.

## Iii Self-learning Stage: Communities and Recommender Systems

After the stage of supervising learning, community of local experts can start working with new, previously unlabeled data. For a new example, the owners will be identified and the task will be solved by the owners following decision from Figs. 3, 6 or similar rules with distribution of responsibility between the most self-confident experts. After such labeling steps the learning cycles should follow with improvement of experts’ performance (they should give the correct self-confident answers to the samples they own, and do not make large mistakes for all other examples).

This regular alternation, solving new tasks – learning – solving new task – …, provides adaptation to the graduate change in reality and assimilation of growing data. It is not compulsory that all local experts are answering the same tasks. A sort of soft biclustering systems of experts and problems should be implemented to link a problem to potential experts and an expert to tasks it can own. Selection of experts should be done with some excess to guarantee sufficient number of selected skilled experts for correct solution. Originally [8], a version of neural network associative memory was proposed to calculate the relative weight of an expert for solution of a problem (we can call it “affinity of an expert to a problem”). A well-developed technology of recommender systems [24] includes many solutions potentially usable for recommendations of local experts to problems and problems to local experts. Implementation of a recommender system for the assignment of local experts to solve problems transforms the community of agents into hierarchical “social network” with various nodes and groups.

## Iv Correctors, Knowledge Transfer, and Interiorisation

Objectives of the community self-learning are:

• Assimilation of incrementally growing data;

• Non-iterative knowledge transfer from the locally best experts to other agents;

In the community self-learning process for each sample the locally best experts (owners) find the label. After the labeling, the skills should be improved. The supervised learning of large multiagent system requires large resources. It should no destroy the previous skills and, therefore, the large labeled data base of previous tasks should be used. It can require large memory and many iterations, which involve all the local experts. It is desirable to correct the errors (or increase the level of confidence, if it is too low) without destroying of previously learned skills. It is also very desirable to avoid usage of large database and long iterative process.

Communities of AI systems in real world will work on the basis of heterogeneous networks of computational devices and in heterogeneous infrastructure. Real-time correction of mistakes in such heterogeneous systems by re-training is not always viable due to the resources involved. We can, therefore, formulate the technical requirements for the correction procedures [17]. Corrector should:

• be simple;

• not destroy the existing skills of the AI systems;

• allow fast non-iterative learning;

• allow correction of new mistakes without destroying of previous corrections.

Surprisingly, all these requirements can be met in sufficiently high dimensions. For this purpose, we propose to employ the concept of corrector of legacy AI systems, developed recently [16, 21] on the basis of stochastic separation theorems [20]. For high-dimensional distributions in -dimensional space every point from a large finite set can be separated from all other points by a simple linear discriminant. The size of this finite set can grow exponentially with . For example, for the equidistribution in an -dimensional ball, with probability every point in independently chosen random points is linearly separable from the set of all other points.

The idea of a corrector is simple. It corrects an error of a single local expert. Separate the sample with error from all other samples by a simple discriminant. This discriminant splits the space of samples into two subsets: the sample with errors belongs to one of them, and all other samples belong to “another half”. Modify the decision rule for the set, which includes the erroneous sample. This is corrector of a legacy AI system (Fig. 8). Inputs for this corrector may include input signals, and any internal or output signal of the AI system.

One corrector can correct several errors (it is useful to cluster them before corrections). For correction of many errors, cascades of correctors are employed [17]: the AI system with the first corrector is a new legacy AI system and can be corrected further, as presented in Fig. 9. In this diagram, the original legacy AI system (shown as Legacy AI System 1) is supplied with a corrector altering its responses. The combined new AI system can in turn be augmented by another corrector, leading to a cascade of AI correctors.

Fast knowledge transfer between AI systems can be organised using correctors [22]. The “teacher” AI labels the samples, and a “student” AI also attempts to label them. If their decisions coincide (with the desired level of confidence) then nothing happens. If they do not coincide (or the level of confidence of a student is too low) then a corrector is created for the student. From the technological point of view it is more efficient to collect samples with student’s errors, then cluster these samples and create correctors for the clusters, not for the individual mistakes. Moreover, new real-world samples are not compulsory needed in the knowledge transfer process. Just a large set of randomly generated (simulated) samples labeled by the teacher AI and the student AI can be used for correction of the student AI with skill transfer from the teacher AI.

Correctors assimilate new knowledge in the course of the community self-learning process (Fig. 10). After collection of a sufficiently large cascade of correctors, a local expert needs to assimilate this knowledge in its internal structure. The main reason for such interiorisation is restoring of the regular essentially high-dimensional structure of the distribution of preprocessed samples with preservation of skills. This process can be iterative but it is much simpler that the initial supervising learning. The local expert with the cascade of correctors becomes the teacher AI, and the same expert without correctors becomes the student AI (see Fig. 10). Available real dataset can be supplemented by the randomly simulated samples and, after iterative learning the skills from the teacher are transferred to the student (if the capacity of the student is sufficient). The student with updated skills returns to the community of local experts.

Two important subsystems are not present in Fig. 10): the manager – recommender and the ultimate auditor. The manager – recommender distributes tasks to local experts and local experts to tasks. It takes decisions on the basis of the previous experience of problem solving and assigns experts to problems with an adequate surplus, for reliability, and with some stochastisation, for the training of various experts and for the extension of experts’ pool.

In practice, the self-learning and self-labeling of samples performed by the selected local experts is supplemented by the labeling of samples and critics of decisions by an ultimate auditor. First of all, this auditor is the real practice itself: the real consequences of the decisions return to the systems. Secondly, the ultimate audit may include inspection by a qualified human or by a special AI audit system with additional skills.

## V Mathematical foundations of non-destructive AI correction

### V-a General stochastic separation theorem

Bárány and Zoltán [25]

studied properties of high-dimensional polytopes deriving from uniform distribution in the

-dimensional unit ball. They found that in the envelope of random points all of the points are on the boundary of their convex hull and none belong to the interior (with probability greater than , provided that , where in an arbitrary constant). They also show that the bound on is nearly tight, up to polynomial factor in . Donoho and Tanner [19]

derived a similar result for i.i.d. points from the Gaussian distribution. They also mentioned that in applications it often seems that Gaussianity is not required and stated the problem of characterisation of ensembles leading to the same qualitative effects (‘phase transitions’), which are found for Gaussian polytopes.

Recently, we noticed that these results could be proven for many other distributions, indeed, and one more important (and surprising) property is also typical: every point in this -point random set can be separated from all other points of this set by the simplest linear Fisher discriminant [16, 20]. This observation allowed us to create the corrector technology for legacy AI systems [21]. We used the ‘thin shell’ measure concentration inequalities to prove these results [20, 17]. Separation by linear Fisher’s discriminant is practically most important Surprise 4 in addition to three surprises mentioned in [19].

The standard approach assumes that the random set consists of independent identically distributed (i.i.d.) random vectors. The new stochastic separation theorem presented below does not assume that the points are identically distributed. It can be very important: in the real practice the new datapoints are not compulsory taken from the same distribution that the previous points. In that sense the typical situation with the real data flow is far from an i.i.d. sample (we are grateful to G. Hinton for this important remark). This new theorem gives also an answer to the

open problem from [19]: it gives the general characterisation of the wide class of distributions with stochastic separation theorems (the SmAC condition below). Roughly speaking, this class consists of distributions without sharp peaks in sets with exponentially small volume (the precise formulation is below). We call this property “Smeared Absolute Continuity” (or SmAC for short) with respect to the Lebesgue measure: the absolute continuity means that the sets of zero measure have zero probability, and the SmAC condition below requires that the sets with exponentially small volume should not have high probability. Below is a unit ball in and denotes the -dimensional Lebesgue measure.

Consider a family of distributions, one for each pair of positive integers and . The general SmAC condition is

###### Definition 1.

The joint distribution of

has SmAC property if there are exist constants , , and , such that for every positive integer , any convex set such that

 Vn(S)Vn(Bn)≤An,

any index , and any points in , we have

 P(xi∈Bn∖S|xj=yj,∀j≠i)≥1−CBn. (1)

We remark that

• We do not require for SmAC condition to hold for all , just for some . However, constants , , and should be independent from and .

• We do not require that are independent. If they are, (1) simplifies to

 P(xi∈Bn∖S)≥1−CBn.
• We do not require that are identically distributed.

• The unit ball in SmAC condition can be replaced by an arbitrary ball, due to rescaling.

• We do not require the distribution to have a bounded support - points are allowed to be outside the ball, but with exponentially small probability.

The following proposition establishes a sufficient condition for SmAC condition to hold.

###### Proposition 1.

Assume that are continuously distributed in with conditional density satisfying

 ρn(xi|xj=yj,∀j≠i)≤CrnVn(Bn) (2)

for any , any index , and any points in , where and are some constants. Then SmAC condition holds with the same , any , and .

###### Proof.
 P(xi∈S|xj=yj,∀j≠i)=∫Sρn(xi|xj=yj,∀j≠i)dV≤∫SCrnVn(Bn)dV=Vn(S)CrnVn(Bn)≤AnVn(Bn)CrnVn(Bn)=CBn.

If are independent with having density , (2) simplifies to

 ρi,n(x)≤CrnVn(Bn),∀n,∀i,∀x∈Bn, (3)

where and are some constants.

With , (3) implies that SmAC condition holds for probability distributions whose density is bounded by a constant times density of uniform distribution in the unit ball. With arbitrary , (3) implies that SmAC condition holds whenever ration grows at most exponentially in . This condition is general enough to hold for many distributions of practical interest.

###### Example 1.

(Unit ball) If are i.i.d random points from the equidistribution in the unit ball, then (3) holds with .

###### Example 2.

(Randomly perturbed data) Fix parameter (random perturbation parameter). Let be the set of arbitrary (non-random) points inside the ball with radius in . They might be clustered in arbitrary way, all belong to a subspace of very low dimension, etc. Let be a point, selected uniformly at random from a ball with center and radius . We think about as “perturbed” version of . In this model, (3) holds with , .

###### Example 3.

(Uniform distribution in a cube) Let be i.i.d random points from the equidistribution in the unit cube. Without loss of generality, we can scale the cube to have side length . Then (3) holds with .

###### Remark 1.

In this case,

 Vn(Bn)ρi,n(x)=Vn(Bn)(√4/n)n=πn/2/Γ(n/2+1)(4/n)n/2<(π/4)n/2nn/2Γ(n/2)≈(π/4)n/2nn/2√4π/n(n/2e)n/2≤12√π(√πe2)n,

where means Stirling’s approximation for gamma function .

###### Example 4.

(Product distribution in unit cube) Let be independent random points from the product distribution in the unit cube, with component of point having a continuous distribution with density . Assume that all are bounded from above by some absolute constant . Then (3) holds with (after appropriate scaling of the cube).

A finite set is called linearly separable if the following equivalent conditions hold.

• For each there exists a linear functional such that for all , ;

• Each is an extreme point (vertex) of convex hull of .

Below we prove the separation theorem for distributions satisfying SmAC condition. The proof is based on the following result, see [26]

###### Proposition 2.

Let

 V(n,M)=1Vn(Bn)maxx1,…,xM∈BnVn(conv{x1,…,xM}),

where conv denotes the convex hull. Then

 V(n,cn)1/n<(2elogc)1/2(1+o(1)),1

Proposition 2 implies that for every , there exists a constant , such that

 V(n,cn)<(3√logc)n,n>N(c). (4)
###### Theorem 1.

Let be a set of random points in from distribution satisfying SmAC condition. Then is linearly separable with probability greater than , , provided that

 M≤abn,

where

 b=min{1.05,1/B,exp((A/3)2)},a=min{1,δ/2C,b−N(b)}.
###### Proof.

If , then , a contradiction. Let , and let . Then

 P(F⊂Bn)≥1−M∑i=1P(xi∉Bn)≥1−M∑i=1CBn=1−MCBn,

where the second inequality follows from (1). Next,

 P(Fis linearly % separable|F⊂Bn)≥1−M∑i=1P(xi∈% conv(F∖{xi})|F⊂Bn).

For set

 Vn(S)Vn(Bn)≤V(n,M−1)≤V(n,bn)<(3√log(b))n≤An,

where we have used (4) and inequalities , . Then SmAC condition implies that

 P(xi∈conv(F∖{xi})|F⊂Bn)=P(xi∈S|F⊂Bn)≤CBn.

Hence,

 P(Fis linearly separable|F⊂Bn)≥1−MCBn,

and

 P(Fis linearly % separable)≥(1−MCBn)2≥1−2MCBn≥1−2abnCBn≥1−δ,

where the last inequality follows from , . ∎

### V-B Stochastic separation by Fisher’s linear discriminant

According to the general stochastic separation theorems there exist

linear functionals, which separate points in a random set (with high probability and under some conditions). Such a linear functional can be found by various iterative methods, from the Rosenblatt perceptron learning rule to support vector machines. This existence is nice but for applications we need the non-iterative learning. It would be very desirable to have an explicit expression for separating functionals.

There exists a general scheme for creation of linear discriminants [22, 17]. For separation of single points from a data cloud it is necessary:

1. Centralise the cloud (subtract the mean point from all data vectors).

2. Escape strong multicollinearity, for example, by principal component analysis and deleting minor components, which correspond to the small eigenvalues of empiric covariance matrix.

3. Perform whitening (or spheric transformation), that is a linear transformation, after that the covariance matrix becomes a unit matrix. In principal components whitening is simply the normalisation of coordinates to unit variance.

4. The linear inequality for separation of a point from the cloud in new coordinates is

 (x,y)≤α(x,x),%forally∈Y. (5)

where is a threshold, and is the standard Euclidean inner product in new coordinates.

In real applied problems, it could be difficult to perform the precise whitening but a rough approximation to this transformation could also create useful discriminants (5). We will call ‘Fisher’s discriminants’ all the discriminants created non-iteratively by inner products (5), with some extension of meaning.

Formally, we say that finite set is Fisher-separable if

 (x,x)>(x,y), (6)

holds for all such that .

Two following theorems demonstrate that Fisher’s discriminants are powerful in high dimensions.

###### Theorem 2 (Equidistribution in Bn[16, 20]).

Let be a set of i.i.d. random points from the equidustribution in the unit ball . Let , and . Then

 P(∥xM∥>r and % (xi,xM∥xM∥)
 P(∥xj∥>r and % (xi,xj∥xj∥)
 P(∥xj∥>r and % (xi∥xi∥,xj∥xj∥)

According to Theorem 2, the probability that a single element from the sample is linearly separated from the set

by the hyperplane

is at least

 1−rn−0.5(M−1)(1−r2)n2.

This probability estimate depends on both

and dimensionality . An interesting consequence of the theorem is that if one picks a probability value, say , then the maximal possible values of for which the set remains linearly separable with probability that is no less than grows at least exponentially with . In particular, the following holds

###### Corollary 1.

Let be a set of i.i.d. random points from the equidustribution in the unit ball . Let , and . If

 M<2(ϑ−rn)/ρn, (10)

then If

 M<(r/ρ)n(−1+√1+2ϑρn/r2n), (11)

then

In particular, if inequality (11) holds then the set is Fisher-separable with probability .

Note that (9) implies that elements of the set are pair-wise almost or -orthogonal, i.e. for all , , with probability larger or equal than . Similar to Corollary 1, one can conclude that the cardinality of samples with such properties grows at least exponentially with . Existence of the phenomenon has been demonstrated in [27]. Theorem 2, Eq. (9), shows that the phenomenon is typical in some sense (cf. [28], [29]).

The linear separability property of finite but exponentially large samples of random i.i.d. elements is not restricted to equidistributions in . As has been noted in [21], it holds for equidistributions in ellipsoids as well as for the Gaussian distributions. Moreover, it can be generalized to product distributions in a unit cube. Consider, e.g. the case when coordinates of the vectors in the set

are independent random variables

, with expectations and variances . Let for all . The following analogue of Theorem 2 can now be stated.

###### Theorem 3 (Product distribution in a cube [20]).

Let be i.i.d. random points from the product distribution in a unit cube. Let

 R20=∑iσ2i≥nσ20.

Assume that data are centralised and . Then

 P(1−δ≤∥xj∥2R20≤1+δ and (xi,xM)R0∥xM∥<√1−δ for all i,j,i≠M)≥1−2Mexp(−2δ2R40/n)−(M−1)exp(−2R40(2−3δ)2/n); (12)
 P(1−δ≤∥xj∥2R20≤1+δ and (xi,xj)R0∥xj∥<√1−δ for all i,j,i≠j)≥1−2Mexp(−2δ2R40/n)−M(M−1)exp(−2R40(2−3δ)2/n). (13)

In particular, under the conditions of Theorem 3, set is Fisher-separable with probability , provided that , where and are some constants depending only on and .

The proof of Theorem 3 is based on concentration inequalities in product spaces [30]. Numerous generalisations of Theorems 2, 3 are possible for different classes of distributions, for example, for weakly dependent variables, etc.

We can see from Theorem 3 that the discriminant (5) works without precise whitening. Just the absence of strong degeneration is required: the support of the distribution contains in the unit cube (that is bounding from above) and, at the same time, the variance of each coordinate is bounded from below by .

Linear separability, as an inherent property of data sets in high dimension, is not necessarily confined to cases whereby a linear functional separates a single element of a set from the rest. Theorems 2, 3 be generalized to account for -tuples, too [22, 17].

Let us make several remarks for the general distributions. For each data point the probability that a randomly chosen point is not separated from by the discriminant (6) (i.e. that the inequality (6) is false) is (Fig. 11)

 p=py=∫∥∥z−y2∥∥≤∥y∥2ρ(z)dz, (14)

where is the probability measure. We need to evaluate the probability of finding a random point outside the union of such excluded volumes. For example, for the equidistribution in a ball , for all from the ball. The probability to select a point inside the union of ‘forbidden balls’ is less than for any position of points in . Inequalities (7), (8), and (9) are closely related to that fact.

Instead of equidistribution in a ball , we can take probability distributions with bounded density in a ball

 ρ(y)

where is an arbitrary constant, is the volume of the ball, and radius . This inequality guarantees that the probability of each ball with radius less or equal than 1/2 exponentially decays for

. It should be stressed that in asymptotic analysis for large

the constant is arbitrary but does not depend on .

For the bounded distributions (15) the separability by linear discriminants (5) is similar to separability for the equidistributions. The proof through the estimation of the probability to avoid the excluded volume is straightforward.

For the practical needs, the number of points is large. We have to evaluate the sum for points . For most estimates, evaluation of the expectations and variances of (14) will be sufficient, if points are independently chosen.

If the distribution is unknown and exists just in the form of the sample of empirical points, we can evaluate and (14) from the sample directly, without knowledge of theoretical probabilities.

Bound (15) is the special case of (3) with , and it is more restrictive: in Examples 2, 3, and 4, the distributions satisfy (3) with some , but fail to satisfy (15) if . Such distributions has the SmAC property and the corresponding set of points is linearly separable by Theorem 1, but different technique is needed to establish its Fisher-separability. One option is to estimate the distribution of in (14). Another technique is based on concentration inequalities. For some distributions, one can prove that, with exponentially high probability, random point satisfies

 r1(n)≤∥x∥≤r2(n), (16)

where denotes the Euclidean norm in , and and are some lower and upper bounds, depending on . If is small comparing to , it means that the distribution is concentrated in a thin shell between the two spheres. If and satisfy (16), inequality (6) may fail only if belongs to a ball with radius . If is much lower than , this method may provide much better probability estimate than (14). This is how Theorem 3 was proved in [20].

## Vi Separation theorem for log-concave distributions

### Vi-a Log-concave distributions

In [20] we proposed several possible generalisations of Theorems 2, 3. One of them is the hypothesis that for the uniformly log-concave distributions the similar result can be formulated and proved. Below we demonstrate that this hypothesis is true, formulate and prove the stochastic separation theorems for several classes of log-concave distributions. Additionally, we prove the comparison (domination) Theorem 5 that allows to extend the proven theorems to wider classes of distributions.

In this subsection, we introduce several classes of log-concave distributions and prove some useful properties of these distributions.

Let be a family of probability measures with densities . Below, is a random variable (r.v) with density , and is the expectation of .

We say that density (and the corresponding probability measure ):

• is whitened, or isotropic, if , and

 En[(x,θ)2)]=1∀θ∈Sn−1, (17)

where is the unit sphere in , and is the standard Euclidean inner product in . The last condition is equivalent to the fact that the covariance matrix of the components of

is the identity matrix, see

[32].

• is log-concave, if set is convex and is a convex function on .

• is strongly log-concave (SLC), if is strongly convex, that is, there exists a constant such that

 g(u)+g(v)2−g(u+v2)≥c||u−v||2,∀u,v∈Dn.

For example, density of

-dimensional standard normal distribution is strongly log-concave with

.

• has sub-Gaussian decay for the norm (SGDN), if there exists a constant such that

 En[exp(ϵ||x||2)]<+∞. (18)

In particular, (18) holds for with any . However, unlike SLC, (18) is an asymptotic property, and is not affected by local modifications of the underlying density. For example, density , where and has SGDN with any , but it is not strongly log-concave.

• has sub-Gaussian decay in every direction (SGDD), if there exists a constant such that inequality

 Pn[(x,θ)≥t]≤2exp(−tB)2

holds for every and .

• is with constant , , if

 (En|(x,θ)|p)1/p≤Bαp1/α(En|(x,θ)|2)1/2 (19)

holds for every for every and all .

###### Proposition 3.

Let be an isotropic log-concave density, and let . The following implications hold.

 ρnis SLC⇒ρnhas SGDN⇒ρnhas SGDD⇔⇔ρnisψ2⇒ρnisψα⇒ρnis% ψ1⇔ALL,

where the last means the class of isotropic log-concave densities which are actually coincides with the class of all isotropic log-concave densities.

###### Proof.

Proposition 3.1 in [33] states that if there exists such that satisfies

 tg(u)+sg(v)−g(tu+sv)≥c1ts2||u−v||2,∀u,v∈Dn. (20)

for all such that , then inequality

 En[f2(x)logf2(x)]−En[f2(x)]En[logf2(x)]≤≤2c1En[||∇f(x)||2] (21)

holds for every smooth function on . As remarked in [33, p. 1035], “it is actually enough that (20) holds for some ”. With , this implies that (21) holds for every strongly log-concave distribution, with . By [34, Theorem 3.1], (21) holds for if and only if it has has sub-Gaussian decay for the norm, and the implication follows. Also, by [37, Theorem 1(i)], if (21) holds for , then it is with constant , where is a universal constant, hence . The equivalence follows from (17) and [35, Lemma 2.2.4]. The implications follow from (19). Finally, [35, Theorem 2.4.6] implies that every log-concave density is with some universal constant. ∎

### Vi-B Fisher-separability for log-concave distributions

Below we prove Fisher-separability for i.i.d samples from isotropic log-concave distributions, using the the technique based on concentration inequalities.

###### Theorem 4.

Let , and let be a family of probability measures with densities , which are with constant , independent from . Let be a set of i.i.d. random points from . Then there exist constants and , which depends only on and , such that, for any , inequality

 (xi,xi)>(xi,xj)

holds with probability at least . Hence, for any , set is Fisher-separable with probability greater than , provided that

 M≤√2δaexp(b2nα/2). (22)
###### Proof.

Let and be two points, selected independently at random from the distribution with density . [31, Theorem 1.1], (applied with , where is identity matrix) states that, for any , (16) holds with , , and with probability at least , where are constants depending only on . If (16) holds for and , inequality (6) may fail only if belongs to a ball with radius . Theorem 6.2 in [36], applied with , states that, for any , does not belong to a ball with any center and radius , with probability at least for some constants and . By selecting , and , we conclude that (6) holds with probability at least . This is greater than for some constants and . Hence, are Fisher-separable with probability greater than . This is greater than provided that satisfies (22). ∎

###### Corollary 2.

Let be a set of i.i.d. random points from an isotropic log-concave distribution in . Then set is Fisher-separable with probability greater than , , provided that

 M≤ac√n,

where and are constants, depending only on .

###### Proof.

This follows from Theorem 4 with and the fact that all log-concave densities are with some universal constant, see Proposition 3. ∎

We say that family of probability measure has exponential Fisher separability if there exist constants and such that, for all , inequality (6) holds with probability at least , where and are i.i.d vectors in selected with respect to . In this case, for any , i.i.d vectors are Fisher-separable with probability at least provided that

 M≤√2δa(1√b)n.
###### Corollary 3.

Let be a family of isotropic log-concave probability measures which are all with the same constant . Then has exponential Fisher separability.

###### Proof.

This follows from Theorem 4 with . ∎

###### Corollary 4.

Let be a family of isotropic probability measures which are all strongly log-concave with the same constant . Then has exponential Fisher separability.

###### Proof.

The proof of Proposition 3 implies that are all with the same constant , where is a universal constant. The statement then follows from Corollary 3. ∎

###### Example 5.

Because standard normal distribution in is strongly log-concave with , Corollary 4 implies that the family of standard normal distributions has exponential Fisher separability.

### Vi-C Domination

We say that family dominates family if there exists a constant such that

 Pn(S)≤C⋅P′n(S) (23)

holds for all and all measurable subsets . In particular, if and have densities and , respectively, then (23) is equivalent to

 ρn(x)≤C⋅ρ′n(x),∀x∈Rn. (24)
###### Theorem 5.

If family has exponential Fisher separability, and dominates , then has exponential Fisher separability.

###### Proof.

For every and , let be a point in with coordinates . Let be the product measure of with itself, that is, for every measurable set , denotes the probability that belongs to , where vectors and are i.i.d vectors selected with respect to . Similarly, let be the product measure of with itself. Inequality (23) implies that

 Qn(S)≤C2⋅Q′n(S),∀S⊂R2n.

Let be the set of all such that . Because has exponential Fisher separability,