1 Introduction
Machine Learning (ML) exists in two forms: theoretical and applied. It appears, they know very little about each other.
Theory can answer important questions about amount of data sufficient to solve the problem (sample complexity). It even proposes a universal algorithm to solve the learning problems (“empiric risk minimization”) [11].
Applications avoid using this universal algorithm. The procedures used in applications are not derived from theory and are very different from each other. For example, Nearest Neighbors is a relatively “intuitive” DIYtype recipe, while SVM is formulated as a convoluted optimization problem.
Besides, the data used in applications, usually, would not be considered sufficient from theoretical point of view.
Yet, the applied ML works. Theory does not explain why and when the empiric algorithms work with small available data, what is common between them, and why are they so different.
To answer these questions, I look at ML from the theory of knowledge point of view. I describe the philosophical context of ML as an Intelligent Learning (IL) process with a feedback loop. In particularly, I argue that no part of this process can be considered as induction.
This view, from outside of the ML itself, allows me to propose a conjecture that ML problems can be seen as minimization of two general criteria defined here. I demonstrate that both NN and Linear SVM, as well as decision trees and Naive Bayes learners corroborate this conjecture: they optimize these criteria, each with unique sets of parameters.
2 Formalization of the learning problem
What is ML? Most textbooks give a vague, poetic description or explain it on examples [11], [2], [8]. Here is one concise and rather formal description [3]:
Given a set of data the task is to learn the relationship between the input and the output such that, given a new input the predicted output is accurate. The pair is not in but assumed to be generated by the same unknown process From a probabilistic modeling perspective, we are therefore concerned primarily with the conditional distribution
This is how the problem usually understood, however I will consider only two class classification here. I rewrite it as a definition of the problem with more convenient notations and terminology.
A case has a data point (feature values), and the class label . Denote the metric space of the data points, the set of class labels. For a sequence of cases denote data points of these cases, the class labels of these cases.
[title= Prediction Problem ]
Given the training set: the sequence of i.i.d. cases from an unknown distribution on ,
To find
A function such that, for any new case
It is convenient to call the cases in “facts”.
The machine which, given an input of the problem, produces a (potential) solution is called learner. For the purposes of this work, a leaner will be defined by its input and output, regardless of its inner work.
The learner’s output is called “decision rule”.
2.1 The Prediction Problem is ill posed
It is an ill posed problem because there is no way to get what needs to be found from what is given in the Prediction problem. And even if we found a solution, we have no way of knowing it.
There are two major sources of uncertainty which make the problem ill posed.

Probabilistic uncertainty: Prediction of the class in is, generally, impossible.
Consider a new case such that for some fact If it is possible that
So, the problem does not allow one to predict with any certainty the class in data points of the training sample.

Uncertainty of extrapolation: Prediction of the class outside of is impossible.
For any data point, not present in the training sample, the problem statement does not give a clue about its class. There is no information about the relationship between the facts and the cases with different data points.
The problem, as it is formulated above, does not have enough information to solve it.
From philosophical logic point of view, the Prediction Problem is the problem of induction: generalizing the single facts or “basic statements” [10] . The rational impossibility of induction was convincingly demonstrated by Hume, and the problem was never fully resolved in philosophy. As in the famous example by Hume, if you always saw only the white swans, it does not mean the next one will be white as well.
The impossibility of induction does not stop ML practitioners or theoreticians because, as I will show, they, actually, solve different problems. The theory and applications found different paths around the absurdity of the formalization of ML as Prediction problem.
The next section shows how Statistical Learning reformulates the problem to prove the existence of the solution and to propose a universal algorithm to find it.
3 Statistical Learning Theory Approach
I discuss here the well developed PAC learning theory [11]. The labels here are assumed to be
The next criteria are used in theory to evaluate the hypotheses.
For a function its general loss is defined by the formula
The functional
is called empiric loss or empiric risk of .
PAC Learning Theory does not have an explicit definition of the PAC learning problem. Going back from the results, one may conclude that “PAC learning” means solving the next problem:
[title =PAC Learning Problem]
Given:

A class of functions (called hypothesis class)

A sequence of i.i.d. cases
from an unknown distribution

Thresholds
To find :
a function
, which “probably” has “approximately” the lowest generalization error in the class
for the distribution :with probability at least
On top of it, the theory is set to solve the next metaproblem. [title = PAC Learning Metaproblem] Given:

A class of functions

The thresholds

A learner for the PAC learning problem
To find:
size of the training set , sufficient for a solution of the PAC learning problem with these thresholds and using this learner, regardless of the underlying distribution .
PAC Learning view of ML is different from the Prediction problem in important aspects:

It assumes the hypothesis class is given.

It does not try to guess the next label or each label. It does not even try to guess most likely labels most of times. Instead, it tries to find a function in , which has almost as little (or as much) errors on the distribution as the best function in the class .

The metaproblem assumes that the learner is chosen upfront as well.

Unlike the Prediction problem, the metaproblem assumes that the size of the training set is not something fixed, and one may increase as needed. Otherwise, why would we want to know the sufficient size of the training set?
Essentially, the PAC learning assumes that, besides the distribution the only thing we do not know is the training set. Which is exactly opposite of the Prediction problem. While the Prediction problem requires to build the learning machine to work with the available data, the PAC learning’s objective is to determine the need in data, based on the selected learning machine and class of functions.
The PAC Learning problem deals with one learner: Empiric Risk Minimization (ERM). This learner finds the function in the given class which minimizes the empiric risk. It turns out, this learner is entirely sufficient for PAC learning.
3.1 PAC Learning resolves the issues of Prediction problem
The Fundamental Theorem of Statistical Learning (FTSL) states that, regardless of the distribution , there is such a number depending on the VCdimension of the class and , that a function with the minimal empiric risk solves the PAC problem with these parameters if
In other words, the FTSL states that using ERM on the training sample solves the PAC Learning problem, provided the training sample is large enough for the given class of functions and the the quality thresholds.
PAC learning solves the ambitious goal of finding a solution for every ML problem, as the theory understands it, regardless of the underlying distribution.
It resolves all the issues plaguing the “Predictive learning” formalization of ML:
The ‘‘probabilistic uncertainty” becomes irrelevant with the assumption of available unlimited data. If we can have unlimited data, we can evaluate the probabilities as well as we need.
The issue of “extrapolation uncertainty” is resolved by fixing the class of functions and approximating the best function in the class, rather than the actual distribution.
It is interesting that from philosophy point of view the PAC learning problem is the problem of deduction, rather than induction, as Predictive problem. PAC Learning goes from a general knowledge (given the class of functions and the learner) and single facts in the training set to a specific knowledge (a function from the class with the good enough parameters).
3.2 PAC learning is not an answer for applications
The theory not only states that the sufficient size of the training sample exists, it gives its the upper bound. There are estimates for the values of
. For example, for very reasonable parameters ( accuracy with of certainty and not more than features), you just need cases in the sample.To appreciate how big it is, let us look at what counts as big data in applications.
The Stanford University skin cancer study [7] is considered a triumph of the big data analysis. The authors proudly say in the abstract:
Here we demonstrate classification of skin lesions using a single CNN, trained endtoend from images directly, using only pixels and disease labels as inputs. We train a CNN using a dataset of 129,450 clinical images — two orders of magnitude larger than previous datasets — consisting of 2,032 different diseases.
The feature vectors in the study represent pixels on images. There got to be much more than 10 pixels, so the VC dimension of the solution has to be much higher that 10. It means, the training set is many orders of magnitude smaller than what is considered sufficient by the PAC learning. Never the less, the solution found in this study is considered satisfactory
[7]:The CNN achieves performance on par with all tested experts … demonstrating an artificial intelligence capable of classifying skin cancer with a level of competence comparable to dermatologists.
By health care industry standards, the data are huge as they are. There are no millions of cases of skin cancer in the whole world for study.
Let us notice that, from the business point of view, the ML with theoretically sufficient training set is, usually, either impossible or counter productive.

In many problems, we need to find the dependence existing only in limited space and time, so it is impossible to wait a long time until more data will be available. Often, the general population is finite and relatively small, as in the example of the skin cancer.

If obtaining the information about the cases is not free, accumulating data set of recommended size would require unlimited resources.

But even if the general population is sufficient, and the company can afford to gather the amount of facts required by the FTSL, the training sample this large already describes the master distribution in very fine details, making ML pointless. People do ML exactly because it allows one to find important answers without waiting many years for the class labels of millions of the cases to become known.
Theory texts, usually, do not stress the expectations for the size of the training set in statistical learning theory. A rare example where the reliance on the indefinitely increasing sample is stated explicitly with some justification is
[6]. Here is how they formalize the learning problem they solve:Given an infinite sequence of the increasing samples from unknown distribution on , to output the sequence of corresponding hypotheses such that
where is the Bayes classifier for the distribution .
The learner which generates the sequence of hypotheses with this property regardless of the distribution is called “consistent”. Thus, their formalization of learning itself is based on idea that
The authors give three reasons for their formalization. Here are their arguments with my counter arguments.

PL: “Consistent rule guarantees us that taking more samples essentially suffices to roughly reconstruct the unknown distribution of (X, Y)”
MS: Consistency only guarantees that taking ever more and more samples (cases) indefinitely leads to good results.

PL: “Without this guarantee, we would not be motivated to take more samples.”
MS: Consistency can only motivate one to increase the training data to infinity, which is impossible. Regardless, one rarely needs an encouragement to get more data when possible.

PL: “We should be careful and not impose conditions on (X, Y) for the consistency of a rule, because such conditions may not be verifiable.”
MS: The conditions on the distribution may be unverifiable, but the fact that the training set does not go to infinity is indisputable. It means, the consistency of the learner is of no practical value for applied problems.
Vapnik formulated the reason for existence of the asymptotic theory of learning in the most direct way [12]:
Why do we need an asymptotic theory if the goal is to construct algorithms from a limited number of observations? The answer is as follows: To construct any theory one has to use some concepts in terms of which the theory is developed
I other words, one has to work with the tools he has, whether they solve the problem at hand or not. Statistics has laws of large numbers, and it is what one uses in statistical learning theory. Perhaps, lack of the suitable apparatus to understand the true problem with finite data is the root of this huge divergence between the theory and the applications.
Not following the Vapnik’s steps, I want to propose a formalization of the ML problem closer to the practical situation. I start with putting it within the context of the Intelligent Learning Cycle.
4 Intelligent Learning Cycle
The idea, mostly, follows Hegel, who was especially interested in the logic of notion, thoughts development. Hegel’s Theory of Thought in nutshell is described by him (in [9], Logic, §83)
In its immediacy, the notion is implicit and in its germ
Its reflection and mediation, the beingforitself and show of the notion;
Its return on itself, and its development abiding by itself  the notion in and for itself
Hegel realized that some notions in some people’s heads may not agree with objective reality exactly. Yet, he did not suggest a path to resolve the issue in his logic.
To fix this gap, I add the fourth stage (which Hegel would hate). This fourth stage is empirical testing, when the hypothesis at the pinnacle of its development meets the objective reality it is supposed to reflect to assess a potential mismatch. With the empiric testing, the linear development of the notion is replaced with the reflective loop.
This empiric testing stage follows, with some caveat, Popper’s idea of “falsifiability” [10] as a characteristic of scientific theory. As Popper noticed, empirical testing cannot confirm a theory. Assuming the randomness of reality, finite testing cannot falsify a hypothesis either. Rather, it improves our estimate of the certainty of the hypothesis. For example, if we thought that most of swans are white, and the next thing we see are 10 black swans, it does not mean that our hypothesis is false. It just means that there got to be some concerns about its validity.
Popper explained the need for empirical testing from the point of view of practical applications. And it is how it understood usually. For example, one cannot publish any paper with application of ML without the results of testing the decision rules on new data.
Similarly to Hegel’s logic, there is no concept of empirical testing in statistical learning theory. The theory considers only the inner consistency of the hypothesis, minimizing the error on the training set which itself was created as a step in the hypothesis development. The inner agreement is what guaranteed by the statistical learning theory. PAC learning promises us some possibility of prediction if we follow the advice about sample size.
The linear part of the hypothesis development goes through the deduction, from the most abstract and general, to the most specific. Yet, making the decision about the hypothesis quality and further actions is not a deduction. This stage can be understood as a dialogue between the hypothesis and reality, as well as the dialogue of the hypothesis with itself, including possible critical evaluation and correction of the hypothesis in this dialogue.
[title = Intelligent Learning (IL) ]

Birth of the hypotheis, or Original insight:

an understanding that an essential property of objects of interest needs to be predicted;

an expectation of the relevant qualities which can be observed

an anticipation of a dependence between the essence and observed qualities

an assumption that the dependence is “learnable”.


Mediation of the hypothesis:

mediation through the data:

refinement of the idea of what to observe, feature engineering;

gathering and curating the training sample to represent the whole distribution;


mediation through selecting a class of functions;

mediation through the criteria of fit between a decision rule and facts;


Selfrealization of the hypothesis, or ML:

producing the decision rule maximizing the selected criteria of fit between a function from the hypothesis class and training sample;

if the inner consistency is not satisfactory, the process has to return to the previous stages.


Empirical testing of the hypothesis on new data
This step includes testing the decision rule on new cases. If the degree of corroboration of the hypothesis is not satisfactory, the process of hypothesis development has to go back the the previous stages, to improve the hypothesis.
The purposes of ML step in IL are :

to maximize inner consistency of the hypothesis, reconcile its mediations,

to decide if the hypothesis is satisfactory,

to make the hypothesis explicit and applicable for making predictions.
In other words, ML is used to prepare the hypothesis to face the reality.
4.1 Deduction, not induction
Hegel [9], of course, saw the evolution of notions as a deductive process going from the most general and implicit to the most specific and explicit. Hume was very convincing proving illogicality of induction. Popper agreed that, contrary to popular belief, development of scientific thought is not an induction, but deduction [10]. Some Popper’s ideas turned out to be very influential. However this idea that learning shall be considered as deduction was not influential among ML scientists.
For example, Vapnik [12] tried to find principles of induction to build correct decision rules  which would be an equivalent of finding a philosopher’s stone. ML as induction is even popularized in very funny Udacity video [1].
I want to stress that no part of the IL can be considered as an induction reasoning.
Some may argue that original insight comes from the observations and experience. First of all, insight is outside of logic, because it is more of an anticipation, hope, rather than a statement. As such, it does not need a justification.
Second, insight can not appear as a result of observations, because without this insight in the first place, one would not know what to look for, and so he would observe everything and nothing in particular.
When it concerns the insight leading to practical application of ML, the sources, usually, are (a) desire to predict something important, (b) previously accumulated general knowledge, and (3) belief in intelligibility of our world.
The mediation step is the step of refinement and specification of (1) type of dependence anticipated in insight, (2) formalization of the features to observe (3) preparation of the data set. All these processes are parts of the inner development and specification of the original insight.
Consider the ML step now. A learner is a mapping from Boolean algebra of all finite subsets of cases into the set of decision functions . It may be considered as a collection of implications Given a training set , a learner (a) picks one implications which corresponds to this training set, and (b) outputs the consequent of this implication, The procedure (b) uses modus ponens,
which is a formal deductive method.
Of course, the same can be said of the empirical testing step. On this step, a general decision rule is applied to a data point of a new case to receive a specific class label on this particular data point.
IL explains the“miracle” of success in applied ML despite the fact that Predictive problem can not be solved and induction can not be justified logically. The issue here is that Predictive problem formalization confuses the goal of IL as a whole with the goal of one step in this process, ML. The goal of IL is, indeed, building the dependence which can be used for prediction. But the goal of ML step is and could only be an inner consistency.
One may still object that if we get a general statement (decision rule) out of some assumptions and limited number of observations in course of IL, we are still relying on the finite limited experience as in the case of induction.
There are two critical differences between the learning as induction and the IL:

Empirical testing is a builtin feedback loop of the IL. ILtype learning is aware of its limitations, is able see the results critically, expects flaws and knows how to deal with them. It is why I call it “intelligent”.

The original insight and its mediation incorporates prior accumulated subject knowledge, ads an outside justification of the decision rule.
IL works best when knowing general tendency is beneficial, and some inevitable errors will not lead to catastrophic events.
4.2 Logical meaning of testing on new data
The test can neither confirm, nor falsify the probabilistic decision rule. However, it can assess plausibility of expected performance of the rule.
In practical terms, the testing can signal overfitting.
Let us consider an example. Suppose, on training, the decision rule gave correct answers. We would be satisfied with correct classification in the future. On the 50 test facts we have of error. It does not mean that we can not get these accuracy in future. But what is the probability of it?
The answer is provided by Hoeffding inequality [11].
where is frequency of error in our test set, is expectation of error, is test sample size, is the error in estimating by the frequency. In this case, we are interested in the possibility that , which would mean that the error of the estimate on the test sample is
Substituting all the numbers, we get
So, the probability that the rule will have a satisfactory performance is too low to count on.
This difference between the expected probability of mislabeling, and frequency of mislabeling on this test sample is too large to ignore it. It means, there is an error somewhere in this IL. One has to go back, find this error and fix it.
5 New formalization of ML
Having found the place and role of ML in IL, I am ready to start developing a new formalization of the ML. The goal is to define the problem solved by actual, working learners.
Examples of the pointwise learners are SVM, neural networks and such. The stepwise type includes decision trees, Naive Bayes and similar learners.
The pointwise learner can be defined by its class of the functions and criteria of fit. The stepwise learners are different by the ways they define the subdomains and by the way they assign the class label on each subdomain.
From observation of the popular pointwise learners, despite obvious differences, their criteria appear to have common features. Generalizing the criteria will formalize the learning problem and allow to understand specifics of each learner.
For generality, we assume a pointwise learner goes through a two step process of producing a decision rule: first, it generates a real valued decision function on , then it applies a labeling transformation obtaining the decision rule.
Definition 1 (Labeling).
Given a function two class labels and two thresholds labeling transformation is defined by formula
where may be any value or undetermined. The values are determined by each algorithm.
For , denote
The function is called decision rule.
A hypothetical case for a decision rule is a case
Definition 2 (Scaling).
Given a class of functions and a labeling transformation a function is a scaling transformation if the next conditions are true:


the transformation is either nondecreasing
or nonincreasing:
Denote projected domain of the class
5.1 Criteria of decision quality
The main part of the original insight in IL is an expectation that the dependence we want to find is “learnable” from a finite training set. The goal of machine learning step is to find a hypothesis the most reconciled with the facts in the way, consistent with the assumed learnability.
It implies that the classes are relatively easy separable: there are not many borderline cases, function values on points of each class are close to to each other.
I propose two general criteria evaluating the desired qualities of a hypothesis. A hypothetical decision function is evaluated in the data points of the training sample, For a given fact ,

direct loss evaluates how often and how much the decision function misses the threshold of the correct class;

proximity loss: evaluates how close are data points of facts to the projected domains of the opposite class.
Here are the exact definitions.
Definition 3 (Direct loss).
Given a training set , labeling thresholds , direct loss of decision function is defined up to learnerspecific parameter of a norm and an nondecreasing scaling

For a fact ,

For the training sample
Definition 4 (Proximity loss).
Given a training set , proximity loss of decision function is defined up to learnerspecific parameter of a norm nonincreasing scaling and a distance on the domain

For a fact

For the training sample
5.2 Learnability and Robustness
The direct and proximity loss are criteria of learnability: the lower values of these criteria the better separation between the classes. And one has to test learnability, because it was a necessary assumption about the dependence, a prerequisite to start learning.
It is interesting that the criteria have another interpretation.
Direct loss criterion is small on a learnermislabeled fact , , if a close function value would make the case correctly classified
The proximity loss criterion is low on a fact if for a close data point the classification is the same:
Proximity and direct loss on whole training set measure overall burden of losses on all facts.
It means, both criteria indicate robustness of the decision rule in the sense that small changes in data will not make the decision rule worse, but can make it better.
Robustness may be considered as a necessary component of learnability. If the relationship between the features and the class are not robust, it is not learnable. On another hand, if the training set is “representative enough”, new data would have small differences with the data we have already in the training set. It would be interesting to investigate the relationship between learnability and robustness formally.
5.3 The Conjecture
The main conjecture of this research is that all learners minimize direct and (or) proximity loss with learnerspecific scaling transformations and norms. The criteria are based on assumptions that the data are easily separable: points of each class shall be close to each other, points of different classes shall not be close to each other. The lower are criteria values for a decision function, the more it agrees with the original learnability and separability assumptions. So, minimization of these criteria may be viewed as optimization of class separation.
[title= Optimal Class Separation problem]
Given:

The training sample

Function class

The class labels and the thresholds of the labeling function ,

Norms and scaling functions of the proximity loss and the direct loss criteria.
To find:
a decision function minimizing direct loss and / or proximity loss with the training set .
It is easy to see that the empiric risk is a case of the direct loss with the norm and the nondecreasing scaling transformation
Therefore, ERM learner solves the Optimal Class Separation problem and corroborates the conjecture.
In the rest of the text, I will show that it is corroborated on such different learners as decision trees, kNN, Naive Bayes, SVM and LASSO.
6 Decision trees
One can distinguish two types of learners by the kind of functions they build:

a pointwise learner builds a function on the data points of

a stepwise learner builds a function on the subdomains of
Consider a typical stepwise learner, decision tree ([8]). The learner starts with whole domain, split it in two subdomains by a value of some feature. Then, the procedure is repeated for every subdomain until some stopping criterion is reached. At this point, the subdomain is called a “leaf”, and a class label is assigned to it. This label is determined by voting of all the facts with data points in the leaf, no other facts participate.
It is convenient to call “leaf” any subdomain where a stepwise learner assigns a label.
The voting decision function is calculated as
Another way to present this function is
the function value coincides with the prevalent sample class in the leaf . It is why is called voting procedure. The function is from the class of two constant functions On the leafs, where the learner does not output any answer.
The labeling function has the threshold The decision rule
The facts outside the leaf do not participate in the calculation of the decision function of a stepwise learner. And for the facts inside the leaf the locations of the data points do not play any role. Therefore, the proximity loss which is based on distances can not be calculated. The next theorem shows voting procedure minimizes a direct loss criterion.
Theorem 1.
Stepwise learner with voting procedure minimizes direct loss defined with the norm and scaling function in each subdomain where the label is assigned.
Proof.
where is number of facts with data points in the leaf . The function coincides with the prevalent sample class in the leaf . Therefore,
where
Therefore, the ∎
The theorem 1 shows that a stepwise learner with voting solves the Optimal Class Separation problem in each leaf, and, therefore, supports the conjecture.
7 kNN
The eponymous “neighbors” for a point are facts denoted as with data points closest (in a selected metric) to among
The neighborhood is a minimal subdomain including all the data points from a (nonspecified) class of subdomains on
There are two classes The learner

defines the class of the neighborhood by voting among the class labels of the facts and

assigns this class to the data point
As I demonstrated on the example of decision trees, the voting procedure minimizes the direct loss on the neighborhood
The twostep process can be explained by the fact that one can not say anything about the loss in the data point So, NN goes around it by assigning the class to the whole neighborhood instead, as a stepwise algorithm would do. Comparing with many other stepwise algorithms, the NN is somewhat more flexible because it finds the neighborhood “surrounding” each new data point.
So, the NN algorithm does solve the Optimal class separation problem as a main part of tits procedure, and, therefore, corroborates the main conjecture of this work.
8 Naive Bayes
A learner has training set as an input and decision function the decision rule as its output.
Besides learners, there are metalearners, or learners aggregation procedures. A metalearner takes as input decision rules or decision functions from some learners and outputs a new decision function (and or) decision rule.
It is assumed that there are two classes: Naive Bayes has both: learners and a metalearner. The learner assumes that every feature is either discrete or discretized, it has a finite number of values. The procedure is accomplished in two steps:

For each feature , for each value , the frequency of the class is calculated among the facts with value of the feature

For each data point the algorithm calculates decision function
.

The algorithm assigns class in the data point using labeling transformation with the thresholds
The function calculated on the first step is the decision function of the voting procedure for the stepwise voting procedure, assigning the class to the subdomain defined by feature equal its th value.
The step 2 of the algorithm is aggregates the decision functions obtained by the learners on the first step.
I demonstrated that voting procedure is an minimization of the the direct loss. Therefore, as far as learning concerns, Naive Bayes solves Optimal Class separation problem for each subdomain, and, therefore, confirms the main conjecture.
9 Linear SVM
All the previous learners belong to “oral prehistory” of machine learning. Their authors are not known, or, at least, not famous.
SVM is one of the first learners associated with a known author: it is invented by V. Vapnik. The earliest English works on this subject were published in early nineties [5], [4].
The algorithms analyzed above used stepwise learning at least as one of the steps. For stepwise learning, the consensus between facts and hypothetical cases with close data points is guaranteed in each subdomain. Splitting the domain on subdomains allows one to ignore the proximity loss.
The SVM was the first classification algorithm I know of which explicitly uses the proximity loss as a criterion for selecting the decision function.
I will deal only with linear SVM here, for simplicity. The decision function of the linear SVM is found the class
There are two class labels: and the threshold of the labeling transformation
The learner is defined not by a procedure or formula of the hypothesis (as NN, for example), but by the optimization problem it is solving.
The problem is formulated as: [title = Linear SVM]
where
9.1 Linear algebra context
Here are some relevant facts from linear algebra and some new terminology for discussing the SVM.

Denote Denote the shortest distance from the hyperplane to the data point
(1) It is important that this fact is true regardless of the norm in the vector space. In SVM, Euclid norm is assumed. This distance plays an important role in SVM concept.

A single hyperplane may be defined with different parameters.
The decision functions with identical separating hyperplanes may be called congruent (). The functions if and only if there exists a scaling coefficient

So, instead of the class it is convenient to use a subclasses where there is onetoone correspondence between functions, hyperplanes, and decision rules. The functions in these classes may be called normalized.

Denote the set of facts, correctly recognized by the decision rule }. It is convenient to call these facts accepts.

SVM uses accepts of each function to normalize it. Denote class of functions which satisfy the condition

Every function in has one and only one congruent function in
Indeed, take Consider
and Then, if are the parameters of the function the function with parameters will belong to the class . Any other function congruent to will not belong to .

Consider a fact where This is the lowest absolute value of the function on any accept. According to the formula (1), the distance of the data point to the separating hyperplane is
This distance is the lowest among the accepts. Accepts satisfying these conditions are called support vectors.

Among the functions in with identical sets of accepts, the function with the smallest norm has largest distance between the support vectors and the hyperplane. The original idea of SVM was to find a linear separation of classes, when it exists, to maximize the distance of the accepts to the separating hyperplane.
This explains why SVM minimizes But where the idea of maximizing the distance to hyperplane came from was never made clear.

In the class of functions the minimal absolute value of the decision function on any accept is 1. It means, there are effectively two thresholds of the labeling transformation: The hypothetical cases where are considered not labeled. This labeling transformation for the class will be denoted It is defined on the thresholds:
Taking into account these facts and notations, the problem may be reformulated:
[title = Linear SVM]
This definition of the problem can be further simplified with the slack variables as well as conditions eliminated.
Theorem 2.
The Linear SVM problem in is equivalent to the problem [title = SVM.1]
Proof.
For a function , the component of Linear SVM objective with the slack variables
(2) 
is subject to :
The conditions can be rewritten as
(3) 
or
Therefore
(4) 
Generally
∎
Now, I do reverse engineering of the SVM to show that the SVM solves the Optimal Class Separation problem.
9.2 Direct Loss criterion in SVM
The criterion is defined with the norm and a nondecreasing scaling
In general case, the direct loss of a decision rule on a fact is defined by the formula:
Substituting the SVM scaling transformation and the thresholds
of the labeling transformation we get
(5) 
Direct loss of the decision function on the sample is
The next theorem shows that if the first term in the SVM problem is omitted, SVM problem is transformed into minimization of the SVM direct loss.
Theorem 3.
For every
Proof.
By definition
So, we just need to prove that for every for every
There are 4 possible options, depending on the class label of and the assigned label :

From the formula (5), In this case, and
∎
The theorem demonstrates that the second component of the SVM.1 objective is the direct loss on the training set.
9.3 Proximity loss criterion for SVM
For this criterion, the distance on is Euclid; The norm is The scaling transformation
In general case, the proximity loss of a function on an accept is defined through the distance to the domain of the opposite class
The domain of the class 1 is defined by the inequality , the domain of the class 1 is satisfy the condition
From the general definition, for an accept ,
For all other cases,
In the class of functions for an accept , the distance to the domain of the opposite class
where is the distance to the hyperplane , and is the distance from the hyperplane to a hyperplane
So,
The proximity loss on the training set is:
Theorem 4.
Proof.
Since the proximity loss nonnegative for every fact, and is nonzero only for accepts ,
In the class , so
∎
Theorem 5.
Linear SVM is equivalent to the Optimal Class Separation problem
Proof.
It follows from the theorem 2, 3, 4. ∎
So, Linear SVM supports the main conjecture of this work as well.
10 Conclusions
Here is the main takeaway from this work so far.

What is Intelligent Learning (IL) cycle?
IL is a cycle of hypothesis development including ML as a step. IL starts with an insight and ends with testing of the hypothesis.

What is ML?
ML is an automated process reconciling aspects of the hypothesis mediation, including the training set and assumptions about the dependence between features and output. ML step produces an explicit representation of the hypothesis as a function.

What is common between classification learners?
The main conjecture of this work is that every classification learner solves Optimal Separation Problem, minimizing criteria of direct and or proximity loss. These criteria serve as formalization of the concept of learnability which is the main assumption about the dependence.

Why are learners so different from each other?
There are pointwise and stepwise learners, and every learner has its own parameters of the criteria.

What counts as a success for ML step?
If the decison rule has low values of learnalility criteria, ML step is successful.

What counts as a success of IL cycle as a whole?
Decision rule has low loss on new (test) data.

Why does IL work with relatively small data?
IL works if and only if it starts with correct ideas about the reality. IL can work even without data: sometimes, an intelligent agent can guess the decision rule. This is what happened with an example of ML problem extensively studied in the [11]: how to predict if papaya is tasty. In the end, authors just give us the rule, without going through the pain of feature specifications, and data gathering, and ML itself.
On another hand, if some of the assumptions were wrong, even with infinitely large training set the results on the test set will be poor.
The approach toward understanding ML proposed here allows one to formulate many more questions.

From the analysis of the learners like NN, Naive Bayes, decision trees and Linear SVM it appears that the conjecture is correct, and Optimal Class Separation problem generalizes all the different problems learners of classification solve. But I am skeptical. Why would there be only two different criteria? It is interesting to find a (well working) learner which contradicts the conjecture.

ML works when the properties of the distribution agree with the assumptions in foundation of a selected learner. For proper learner selection, it is important to discover and state explicitly the assumptions about the distribution for every learner.

The criteria of learnability formulated here (direct and proximity loss) closely related with concept of robustness of an approximation problem: small variations in data shall not lead to large changes in a solution. It is interesting to explore the relationship between the learnability and robustness.

It appears the proximity loss is relatively late invention: earlier learners did not use this criterion. What are the advantages of adding proximity loss?

Empiric loss criterion used in ML theory is a Boolean function for every data point, and it fits classical logic just fine. For both direct and proximity loss, the estimate of truthfulness in each point is not binary, and for proximity loss it takes into account spacial relationship and distances. It is interesting to understand the logic with two such criteria of truthfulness.

While every step of the IL may be considered as deduction, the philosophical logic behind a feedback loop is not well researched, as far as I know. It is interesting to understand it better in the context of knowledge theory.

I demonstrated on an example, that testing can probabilistically invalidate decision rule. When and how does testing supports the decision rule? Generally, what kind of conclusions about the decision rule can we get from testing it on new data?

The current work concern only the classification problem. What about other ML problems, like regression, ranking, clustering? Is it possible to discern some learnability criteria that the learners for these problems optimize?
Further research is needed to answer those questions.
References
 [1] Induction and Deduction  Georgia Tech  Machine Learning. https://youtu.be/pqXASFHUfhs.
 [2] Y.S AbuMostafa, MagdonIsmail. M, and L. HsuanTien. Learning From Data. AMLbook.com, 2012.
 [3] D. Barber. Bayesian Reasoning and Machine Learning. Cambridge University Press, NY, 2012.

[4]
B.E. Boser, I. M. Guyon, and V.N. Vapnik.
A training algorithm for optimal margin classifiers.
In
COLT ’92: Proceedings of the fifth annual workshop on Computational learning theory
.  [5] C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20:273 – 297, 1995.

[6]
Luc Devroye, Laszlo Gyorfi, and Gabor Lugosi.
A Probabilistic Theory of Pattern Recognition
. SpringerVerlag New York, 1996.  [7] A. Esteva, B. Kuprel, R. Novoa, et al. Dermatologistlevel classification of skin cancer with deep neural networks. Nature, 542:115–118, 2017.
 [8] T. Hastie, R. Tibshirani, and J. Friedman. Elements of statistical learning. Springer, 2009.
 [9] F.W.F Hegel. Delphi Collected Works of George Willhelm Friedrich Hegel. Delphi Classics, 2019.
 [10] Karl R. Popper. The Logic of Scientific Discovery. Martino Publishing, CT., 2014.
 [11] Shai ShalevShwartz and Shai BenDavid. Understanding Machine Learning. Cambridge University Press, NY, 2014.
 [12] V. N. Vapnik. The nature of statistical learning theory. Springer  Verlag, 1995.
Comments
There are no comments yet.