Toward Theory of Applied Learning. What is Machine Learning?

06/16/2020
by   Marina Sapir, et al.
0

Various existing approaches to formalize machine learning (ML) problem are discussed. The concept of Intelligent Learning (IL) as a context of ML is introduced. IL is described following traditions of Hegel's logic. A general formalization of classification as Optimal Class Separation problem is proposed. The formalization includes two criteria, direct and proximity loss, introduced here. It is demonstrated that k-NN, Naive Bayes, decision trees, linear SVM solve Optimal Class Separation problem.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 4

page 8

page 13

page 18

07/14/2020

Machine Learning for Offensive Security: Sandbox Classification Using Decision Trees and Artificial Neural Networks

The merits of machine learning in information security have primarily fo...
02/02/2020

Using Machine Learning for Model Physics: an Overview

In the overview, a generic mathematical object (mapping) is introduced, ...
03/15/2022

Approximate Decision Trees For Machine Learning Classification on Tiny Printed Circuits

Although Printed Electronics (PE) cannot compete with silicon-based syst...
04/15/2017

Machine Learning and the Future of Realism

The preceding three decades have seen the emergence, rise, and prolifera...
05/20/2020

An Analysis of Regularized Approaches for Constrained Machine Learning

Regularization-based approaches for injecting constraints in Machine Lea...
06/26/2017

Optimal choice: new machine learning problem and its solution

The task of learning to pick a single preferred example out a finite set...
12/10/2015

Norm-Free Radon-Nikodym Approach to Machine Learning

For Machine Learning (ML) classification problem, where a vector of x--o...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Machine Learning (ML) exists in two forms: theoretical and applied. It appears, they know very little about each other.

Theory can answer important questions about amount of data sufficient to solve the problem (sample complexity). It even proposes a universal algorithm to solve the learning problems (“empiric risk minimization”) [11].

Applications avoid using this universal algorithm. The procedures used in applications are not derived from theory and are very different from each other. For example, Nearest Neighbors is a relatively “intuitive” DIY-type recipe, while SVM is formulated as a convoluted optimization problem.

Besides, the data used in applications, usually, would not be considered sufficient from theoretical point of view.

Yet, the applied ML works. Theory does not explain why and when the empiric algorithms work with small available data, what is common between them, and why are they so different.

To answer these questions, I look at ML from the theory of knowledge point of view. I describe the philosophical context of ML as an Intelligent Learning (IL) process with a feedback loop. In particularly, I argue that no part of this process can be considered as induction.

This view, from outside of the ML itself, allows me to propose a conjecture that ML problems can be seen as minimization of two general criteria defined here. I demonstrate that both -NN and Linear SVM, as well as decision trees and Naive Bayes learners corroborate this conjecture: they optimize these criteria, each with unique sets of parameters.

2 Formalization of the learning problem

What is ML? Most textbooks give a vague, poetic description or explain it on examples [11], [2], [8]. Here is one concise and rather formal description [3]:

Given a set of data the task is to learn the relationship between the input and the output such that, given a new input the predicted output is accurate. The pair is not in but assumed to be generated by the same unknown process From a probabilistic modeling perspective, we are therefore concerned primarily with the conditional distribution

This is how the problem usually understood, however I will consider only two class classification here. I rewrite it as a definition of the problem with more convenient notations and terminology.

A case has a data point (feature values), and the class label . Denote the metric space of the data points, the set of class labels. For a sequence of cases denote data points of these cases, the class labels of these cases.

[title= Prediction Problem ]

Given the training set: the sequence of i.i.d. cases from an unknown distribution on ,

To find

A function such that, for any new case

It is convenient to call the cases in “facts”.

The machine which, given an input of the problem, produces a (potential) solution is called learner. For the purposes of this work, a leaner will be defined by its input and output, regardless of its inner work.

The learner’s output is called “decision rule”.

2.1 The Prediction Problem is ill posed

It is an ill posed problem because there is no way to get what needs to be found from what is given in the Prediction problem. And even if we found a solution, we have no way of knowing it.

There are two major sources of uncertainty which make the problem ill posed.

  1. Probabilistic uncertainty: Prediction of the class in is, generally, impossible.

    Consider a new case such that for some fact If it is possible that

    So, the problem does not allow one to predict with any certainty the class in data points of the training sample.

  2. Uncertainty of extrapolation: Prediction of the class outside of is impossible.

    For any data point, not present in the training sample, the problem statement does not give a clue about its class. There is no information about the relationship between the facts and the cases with different data points.

The problem, as it is formulated above, does not have enough information to solve it.

From philosophical logic point of view, the Prediction Problem is the problem of induction: generalizing the single facts or “basic statements” [10] . The rational impossibility of induction was convincingly demonstrated by Hume, and the problem was never fully resolved in philosophy. As in the famous example by Hume, if you always saw only the white swans, it does not mean the next one will be white as well.

The impossibility of induction does not stop ML practitioners or theoreticians because, as I will show, they, actually, solve different problems. The theory and applications found different paths around the absurdity of the formalization of ML as Prediction problem.

The next section shows how Statistical Learning reformulates the problem to prove the existence of the solution and to propose a universal algorithm to find it.

3 Statistical Learning Theory Approach

I discuss here the well developed PAC learning theory [11]. The labels here are assumed to be

The next criteria are used in theory to evaluate the hypotheses.

For a function its general loss is defined by the formula

The functional

is called empiric loss or empiric risk of .

PAC Learning Theory does not have an explicit definition of the PAC learning problem. Going back from the results, one may conclude that “PAC learning” means solving the next problem:

[title =PAC Learning Problem]

Given:

  • A class of functions (called hypothesis class)

  • A sequence of i.i.d. cases

    from an unknown distribution

  • Thresholds

To find :

a function

, which “probably” has “approximately” the lowest generalization error in the class

for the distribution :

with probability at least

On top of it, the theory is set to solve the next meta-problem. [title = PAC Learning Meta-problem] Given:

  • A class of functions

  • The thresholds

  • A learner for the PAC learning problem

To find:

size of the training set , sufficient for a solution of the PAC learning problem with these thresholds and using this learner, regardless of the underlying distribution .

PAC Learning view of ML is different from the Prediction problem in important aspects:

  1. It assumes the hypothesis class is given.

  2. It does not try to guess the next label or each label. It does not even try to guess most likely labels most of times. Instead, it tries to find a function in , which has almost as little (or as much) errors on the distribution as the best function in the class .

  3. The meta-problem assumes that the learner is chosen upfront as well.

  4. Unlike the Prediction problem, the meta-problem assumes that the size of the training set is not something fixed, and one may increase as needed. Otherwise, why would we want to know the sufficient size of the training set?

Essentially, the PAC learning assumes that, besides the distribution the only thing we do not know is the training set. Which is exactly opposite of the Prediction problem. While the Prediction problem requires to build the learning machine to work with the available data, the PAC learning’s objective is to determine the need in data, based on the selected learning machine and class of functions.

The PAC Learning problem deals with one learner: Empiric Risk Minimization (ERM). This learner finds the function in the given class which minimizes the empiric risk. It turns out, this learner is entirely sufficient for PAC learning.

3.1 PAC Learning resolves the issues of Prediction problem

The Fundamental Theorem of Statistical Learning (FTSL) states that, regardless of the distribution , there is such a number depending on the VC-dimension of the class and , that a function with the minimal empiric risk solves the PAC problem with these parameters if

In other words, the FTSL states that using ERM on the training sample solves the PAC Learning problem, provided the training sample is large enough for the given class of functions and the the quality thresholds.

PAC learning solves the ambitious goal of finding a solution for every ML problem, as the theory understands it, regardless of the underlying distribution.

It resolves all the issues plaguing the “Predictive learning” formalization of ML:

The ‘‘probabilistic uncertainty” becomes irrelevant with the assumption of available unlimited data. If we can have unlimited data, we can evaluate the probabilities as well as we need.

The issue of “extrapolation uncertainty” is resolved by fixing the class of functions and approximating the best function in the class, rather than the actual distribution.

It is interesting that from philosophy point of view the PAC learning problem is the problem of deduction, rather than induction, as Predictive problem. PAC Learning goes from a general knowledge (given the class of functions and the learner) and single facts in the training set to a specific knowledge (a function from the class with the good enough parameters).

3.2 PAC learning is not an answer for applications

The theory not only states that the sufficient size of the training sample exists, it gives its the upper bound. There are estimates for the values of

. For example, for very reasonable parameters ( accuracy with of certainty and not more than features), you just need cases in the sample.

To appreciate how big it is, let us look at what counts as big data in applications.

The Stanford University skin cancer study [7] is considered a triumph of the big data analysis. The authors proudly say in the abstract:

Here we demonstrate classification of skin lesions using a single CNN, trained end-to-end from images directly, using only pixels and disease labels as inputs. We train a CNN using a dataset of 129,450 clinical images — two orders of magnitude larger than previous datasets — consisting of 2,032 different diseases.

The feature vectors in the study represent pixels on images. There got to be much more than 10 pixels, so the VC dimension of the solution has to be much higher that 10. It means, the training set is many orders of magnitude smaller than what is considered sufficient by the PAC learning. Never the less, the solution found in this study is considered satisfactory

[7]:

The CNN achieves performance on par with all tested experts … demonstrating an artificial intelligence capable of classifying skin cancer with a level of competence comparable to dermatologists.

By health care industry standards, the data are huge as they are. There are no millions of cases of skin cancer in the whole world for study.

Let us notice that, from the business point of view, the ML with theoretically sufficient training set is, usually, either impossible or counter productive.

  • In many problems, we need to find the dependence existing only in limited space and time, so it is impossible to wait a long time until more data will be available. Often, the general population is finite and relatively small, as in the example of the skin cancer.

  • If obtaining the information about the cases is not free, accumulating data set of recommended size would require unlimited resources.

  • But even if the general population is sufficient, and the company can afford to gather the amount of facts required by the FTSL, the training sample this large already describes the master distribution in very fine details, making ML pointless. People do ML exactly because it allows one to find important answers without waiting many years for the class labels of millions of the cases to become known.

Theory texts, usually, do not stress the expectations for the size of the training set in statistical learning theory. A rare example where the reliance on the indefinitely increasing sample is stated explicitly with some justification is

[6]. Here is how they formalize the learning problem they solve:

Given an infinite sequence of the increasing samples from unknown distribution on , to output the sequence of corresponding hypotheses such that

where is the Bayes classifier for the distribution .

The learner which generates the sequence of hypotheses with this property regardless of the distribution is called “consistent”. Thus, their formalization of learning itself is based on idea that

The authors give three reasons for their formalization. Here are their arguments with my counter arguments.

  1. PL: “Consistent rule guarantees us that taking more samples essentially suffices to roughly reconstruct the unknown distribution of (X, Y)”

    MS: Consistency only guarantees that taking ever more and more samples (cases) indefinitely leads to good results.

  2. PL: “Without this guarantee, we would not be motivated to take more samples.”

    MS: Consistency can only motivate one to increase the training data to infinity, which is impossible. Regardless, one rarely needs an encouragement to get more data when possible.

  3. PL: “We should be careful and not impose conditions on (X, Y) for the consistency of a rule, because such conditions may not be verifiable.”

    MS: The conditions on the distribution may be unverifiable, but the fact that the training set does not go to infinity is indisputable. It means, the consistency of the learner is of no practical value for applied problems.

Vapnik formulated the reason for existence of the asymptotic theory of learning in the most direct way [12]:

Why do we need an asymptotic theory if the goal is to construct algorithms from a limited number of observations? The answer is as follows: To construct any theory one has to use some concepts in terms of which the theory is developed

I other words, one has to work with the tools he has, whether they solve the problem at hand or not. Statistics has laws of large numbers, and it is what one uses in statistical learning theory. Perhaps, lack of the suitable apparatus to understand the true problem with finite data is the root of this huge divergence between the theory and the applications.

Not following the Vapnik’s steps, I want to propose a formalization of the ML problem closer to the practical situation. I start with putting it within the context of the Intelligent Learning Cycle.

4 Intelligent Learning Cycle

The idea, mostly, follows Hegel, who was especially interested in the logic of notion, thoughts development. Hegel’s Theory of Thought in nutshell is described by him (in [9], Logic, §83)

  1. In its immediacy, the notion is implicit and in its germ

  2. Its reflection and mediation, the being-for-itself and show of the notion;

  3. Its return on itself, and its development abiding by itself - the notion in and for itself

Hegel realized that some notions in some people’s heads may not agree with objective reality exactly. Yet, he did not suggest a path to resolve the issue in his logic.

To fix this gap, I add the fourth stage (which Hegel would hate). This fourth stage is empirical testing, when the hypothesis at the pinnacle of its development meets the objective reality it is supposed to reflect to assess a potential mismatch. With the empiric testing, the linear development of the notion is replaced with the reflective loop.

This empiric testing stage follows, with some caveat, Popper’s idea of “falsifiability” [10] as a characteristic of scientific theory. As Popper noticed, empirical testing cannot confirm a theory. Assuming the randomness of reality, finite testing cannot falsify a hypothesis either. Rather, it improves our estimate of the certainty of the hypothesis. For example, if we thought that most of swans are white, and the next thing we see are 10 black swans, it does not mean that our hypothesis is false. It just means that there got to be some concerns about its validity.

Popper explained the need for empirical testing from the point of view of practical applications. And it is how it understood usually. For example, one cannot publish any paper with application of ML without the results of testing the decision rules on new data.

Similarly to Hegel’s logic, there is no concept of empirical testing in statistical learning theory. The theory considers only the inner consistency of the hypothesis, minimizing the error on the training set which itself was created as a step in the hypothesis development. The inner agreement is what guaranteed by the statistical learning theory. PAC learning promises us some possibility of prediction if we follow the advice about sample size.

The linear part of the hypothesis development goes through the deduction, from the most abstract and general, to the most specific. Yet, making the decision about the hypothesis quality and further actions is not a deduction. This stage can be understood as a dialogue between the hypothesis and reality, as well as the dialogue of the hypothesis with itself, including possible critical evaluation and correction of the hypothesis in this dialogue.

[title = Intelligent Learning (IL) ]

  1. Birth of the hypotheis, or Original insight:

    • an understanding that an essential property of objects of interest needs to be predicted;

    • an expectation of the relevant qualities which can be observed

    • an anticipation of a dependence between the essence and observed qualities

    • an assumption that the dependence is “learnable”.

  2. Mediation of the hypothesis:

    • mediation through the data:

      • refinement of the idea of what to observe, feature engineering;

      • gathering and curating the training sample to represent the whole distribution;

    • mediation through selecting a class of functions;

    • mediation through the criteria of fit between a decision rule and facts;

  3. Self-realization of the hypothesis, or ML:

    • producing the decision rule maximizing the selected criteria of fit between a function from the hypothesis class and training sample;

    • if the inner consistency is not satisfactory, the process has to return to the previous stages.

  4. Empirical testing of the hypothesis on new data

    This step includes testing the decision rule on new cases. If the degree of corroboration of the hypothesis is not satisfactory, the process of hypothesis development has to go back the the previous stages, to improve the hypothesis.

The purposes of ML step in IL are :

  • to maximize inner consistency of the hypothesis, reconcile its mediations,

  • to decide if the hypothesis is satisfactory,

  • to make the hypothesis explicit and applicable for making predictions.

In other words, ML is used to prepare the hypothesis to face the reality.

4.1 Deduction, not induction

Hegel [9], of course, saw the evolution of notions as a deductive process going from the most general and implicit to the most specific and explicit. Hume was very convincing proving illogicality of induction. Popper agreed that, contrary to popular belief, development of scientific thought is not an induction, but deduction [10]. Some Popper’s ideas turned out to be very influential. However this idea that learning shall be considered as deduction was not influential among ML scientists.

For example, Vapnik [12] tried to find principles of induction to build correct decision rules - which would be an equivalent of finding a philosopher’s stone. ML as induction is even popularized in very funny Udacity video [1].

I want to stress that no part of the IL can be considered as an induction reasoning.

Some may argue that original insight comes from the observations and experience. First of all, insight is outside of logic, because it is more of an anticipation, hope, rather than a statement. As such, it does not need a justification.

Second, insight can not appear as a result of observations, because without this insight in the first place, one would not know what to look for, and so he would observe everything and nothing in particular.

When it concerns the insight leading to practical application of ML, the sources, usually, are (a) desire to predict something important, (b) previously accumulated general knowledge, and (3) belief in intelligibility of our world.

The mediation step is the step of refinement and specification of (1) type of dependence anticipated in insight, (2) formalization of the features to observe (3) preparation of the data set. All these processes are parts of the inner development and specification of the original insight.

Consider the ML step now. A learner is a mapping from Boolean algebra of all finite subsets of cases into the set of decision functions . It may be considered as a collection of implications Given a training set , a learner (a) picks one implications which corresponds to this training set, and (b) outputs the consequent of this implication, The procedure (b) uses modus ponens,

which is a formal deductive method.

Of course, the same can be said of the empirical testing step. On this step, a general decision rule is applied to a data point of a new case to receive a specific class label on this particular data point.

IL explains the“miracle” of success in applied ML despite the fact that Predictive problem can not be solved and induction can not be justified logically. The issue here is that Predictive problem formalization confuses the goal of IL as a whole with the goal of one step in this process, ML. The goal of IL is, indeed, building the dependence which can be used for prediction. But the goal of ML step is and could only be an inner consistency.

One may still object that if we get a general statement (decision rule) out of some assumptions and limited number of observations in course of IL, we are still relying on the finite limited experience as in the case of induction.

There are two critical differences between the learning as induction and the IL:

  1. Empirical testing is a built-in feedback loop of the IL. IL-type learning is aware of its limitations, is able see the results critically, expects flaws and knows how to deal with them. It is why I call it “intelligent”.

  2. The original insight and its mediation incorporates prior accumulated subject knowledge, ads an outside justification of the decision rule.

IL works best when knowing general tendency is beneficial, and some inevitable errors will not lead to catastrophic events.

4.2 Logical meaning of testing on new data

The test can neither confirm, nor falsify the probabilistic decision rule. However, it can assess plausibility of expected performance of the rule.

In practical terms, the testing can signal overfitting.

Let us consider an example. Suppose, on training, the decision rule gave correct answers. We would be satisfied with correct classification in the future. On the 50 test facts we have of error. It does not mean that we can not get these accuracy in future. But what is the probability of it?

The answer is provided by Hoeffding inequality [11].

where is frequency of error in our test set, is expectation of error, is test sample size, is the error in estimating by the frequency. In this case, we are interested in the possibility that , which would mean that the error of the estimate on the test sample is

Substituting all the numbers, we get

So, the probability that the rule will have a satisfactory performance is too low to count on.

This difference between the expected probability of mislabeling, and frequency of mislabeling on this test sample is too large to ignore it. It means, there is an error somewhere in this IL. One has to go back, find this error and fix it.

5 New formalization of ML

Having found the place and role of ML in IL, I am ready to start developing a new formalization of the ML. The goal is to define the problem solved by actual, working learners.

Examples of the point-wise learners are SVM, neural networks and such. The step-wise type includes decision trees, Naive Bayes and similar learners.

The point-wise learner can be defined by its class of the functions and criteria of fit. The step-wise learners are different by the ways they define the subdomains and by the way they assign the class label on each subdomain.

From observation of the popular point-wise learners, despite obvious differences, their criteria appear to have common features. Generalizing the criteria will formalize the learning problem and allow to understand specifics of each learner.

For generality, we assume a point-wise learner goes through a two step process of producing a decision rule: first, it generates a real valued decision function on , then it applies a labeling transformation obtaining the decision rule.

Definition 1 (Labeling).

Given a function two class labels and two thresholds labeling transformation is defined by formula

where may be any value or undetermined. The values are determined by each algorithm.

For , denote

The function is called decision rule.

A hypothetical case for a decision rule is a case

Definition 2 (Scaling).

Given a class of functions and a labeling transformation a function is a scaling transformation if the next conditions are true:

  1. the transformation is either non-decreasing

    or non-increasing:

Denote projected domain of the class

5.1 Criteria of decision quality

The main part of the original insight in IL is an expectation that the dependence we want to find is “learnable” from a finite training set. The goal of machine learning step is to find a hypothesis the most reconciled with the facts in the way, consistent with the assumed learnability.

It implies that the classes are relatively easy separable: there are not many border-line cases, function values on points of each class are close to to each other.

I propose two general criteria evaluating the desired qualities of a hypothesis. A hypothetical decision function is evaluated in the data points of the training sample, For a given fact ,

  • direct loss evaluates how often and how much the decision function misses the threshold of the correct class;

  • proximity loss: evaluates how close are data points of facts to the projected domains of the opposite class.

Here are the exact definitions.

Definition 3 (Direct loss).

Given a training set , labeling thresholds , direct loss of decision function is defined up to learner-specific parameter of a norm and an non-decreasing scaling

  • For a fact ,

  • For the training sample

Definition 4 (Proximity loss).

Given a training set , proximity loss of decision function is defined up to learner-specific parameter of a norm non-increasing scaling and a distance on the domain

  • For a fact

  • For the training sample

5.2 Learnability and Robustness

The direct and proximity loss are criteria of learnability: the lower values of these criteria the better separation between the classes. And one has to test learnability, because it was a necessary assumption about the dependence, a prerequisite to start learning.

It is interesting that the criteria have another interpretation.

Direct loss criterion is small on a learner-mislabeled fact , , if a close function value would make the case correctly classified

The proximity loss criterion is low on a fact if for a close data point the classification is the same:

Proximity and direct loss on whole training set measure overall burden of losses on all facts.

It means, both criteria indicate robustness of the decision rule in the sense that small changes in data will not make the decision rule worse, but can make it better.

Robustness may be considered as a necessary component of learnability. If the relationship between the features and the class are not robust, it is not learnable. On another hand, if the training set is “representative enough”, new data would have small differences with the data we have already in the training set. It would be interesting to investigate the relationship between learnability and robustness formally.

5.3 The Conjecture

The main conjecture of this research is that all learners minimize direct and (or) proximity loss with learner-specific scaling transformations and norms. The criteria are based on assumptions that the data are easily separable: points of each class shall be close to each other, points of different classes shall not be close to each other. The lower are criteria values for a decision function, the more it agrees with the original learnability and separability assumptions. So, minimization of these criteria may be viewed as optimization of class separation.

[title= Optimal Class Separation problem]

Given:

  • The training sample

  • Function class

  • The class labels and the thresholds of the labeling function ,

  • Norms and scaling functions of the proximity loss and the direct loss criteria.

To find:

a decision function minimizing direct loss and / or proximity loss with the training set .

It is easy to see that the empiric risk is a case of the direct loss with the norm and the non-decreasing scaling transformation

Therefore, ERM learner solves the Optimal Class Separation problem and corroborates the conjecture.

In the rest of the text, I will show that it is corroborated on such different learners as decision trees, k-NN, Naive Bayes, SVM and LASSO.

6 Decision trees

One can distinguish two types of learners by the kind of functions they build:

  1. a point-wise learner builds a function on the data points of

  2. a step-wise learner builds a function on the sub-domains of

Consider a typical step-wise learner, decision tree ([8]). The learner starts with whole domain, split it in two subdomains by a value of some feature. Then, the procedure is repeated for every subdomain until some stopping criterion is reached. At this point, the subdomain is called a “leaf”, and a class label is assigned to it. This label is determined by voting of all the facts with data points in the leaf, no other facts participate.

It is convenient to call “leaf” any subdomain where a step-wise learner assigns a label.

The voting decision function is calculated as

Another way to present this function is

the function value coincides with the prevalent sample class in the leaf . It is why is called voting procedure. The function is from the class of two constant functions On the leafs, where the learner does not output any answer.

The labeling function has the threshold The decision rule

The facts outside the leaf do not participate in the calculation of the decision function of a step-wise learner. And for the facts inside the leaf the locations of the data points do not play any role. Therefore, the proximity loss which is based on distances can not be calculated. The next theorem shows voting procedure minimizes a direct loss criterion.

Theorem 1.

Step-wise learner with voting procedure minimizes direct loss defined with the norm and scaling function in each subdomain where the label is assigned.

Proof.

where is number of facts with data points in the leaf . The function coincides with the prevalent sample class in the leaf . Therefore,

where

Therefore, the

The theorem 1 shows that a step-wise learner with voting solves the Optimal Class Separation problem in each leaf, and, therefore, supports the conjecture.

7 k-NN

The eponymous “neighbors” for a point are facts denoted as with data points closest (in a selected metric) to among

The neighborhood is a minimal sub-domain including all the data points from a (non-specified) class of subdomains on

There are two classes The learner

  1. defines the class of the neighborhood by voting among the class labels of the facts and

  2. assigns this class to the data point

As I demonstrated on the example of decision trees, the voting procedure minimizes the direct loss on the neighborhood

The two-step process can be explained by the fact that one can not say anything about the loss in the data point So, -NN goes around it by assigning the class to the whole neighborhood instead, as a step-wise algorithm would do. Comparing with many other step-wise algorithms, the -NN is some-what more flexible because it finds the neighborhood “surrounding” each new data point.

So, the -NN algorithm does solve the Optimal class separation problem as a main part of tits procedure, and, therefore, corroborates the main conjecture of this work.

8 Naive Bayes

A learner has training set as an input and decision function the decision rule as its output.

Besides learners, there are meta-learners, or learners aggregation procedures. A meta-learner takes as input decision rules or decision functions from some learners and outputs a new decision function (and or) decision rule.

It is assumed that there are two classes: Naive Bayes has both: learners and a meta-learner. The learner assumes that every feature is either discrete or discretized, it has a finite number of values. The procedure is accomplished in two steps:

  1. For each feature , for each value , the frequency of the class is calculated among the facts with value of the feature

  2. For each data point the algorithm calculates decision function

    .

  3. The algorithm assigns class in the data point using labeling transformation with the thresholds

The function calculated on the first step is the decision function of the voting procedure for the step-wise voting procedure, assigning the class to the subdomain defined by feature equal its th value.

The step 2 of the algorithm is aggregates the decision functions obtained by the learners on the first step.

I demonstrated that voting procedure is an minimization of the the direct loss. Therefore, as far as learning concerns, Naive Bayes solves Optimal Class separation problem for each subdomain, and, therefore, confirms the main conjecture.

9 Linear SVM

All the previous learners belong to “oral prehistory” of machine learning. Their authors are not known, or, at least, not famous.

SVM is one of the first learners associated with a known author: it is invented by V. Vapnik. The earliest English works on this subject were published in early nineties [5], [4].

The algorithms analyzed above used step-wise learning at least as one of the steps. For step-wise learning, the consensus between facts and hypothetical cases with close data points is guaranteed in each subdomain. Splitting the domain on subdomains allows one to ignore the proximity loss.

The SVM was the first classification algorithm I know of which explicitly uses the proximity loss as a criterion for selecting the decision function.

I will deal only with linear SVM here, for simplicity. The decision function of the linear SVM is found the class

There are two class labels: and the threshold of the labeling transformation

The learner is defined not by a procedure or formula of the hypothesis (as -NN, for example), but by the optimization problem it is solving.

The problem is formulated as: [title = Linear SVM]

where

9.1 Linear algebra context

Here are some relevant facts from linear algebra and some new terminology for discussing the SVM.

  1. The vector

    is orthogonal to the hyperplane

  2. Denote Denote the shortest distance from the hyper-plane to the data point

    (1)

    It is important that this fact is true regardless of the norm in the vector space. In SVM, Euclid norm is assumed. This distance plays an important role in SVM concept.

  3. A single hyperplane may be defined with different parameters.

    The decision functions with identical separating hyperplanes may be called congruent (). The functions if and only if there exists a scaling coefficient

  4. So, instead of the class it is convenient to use a sub-classes where there is one-to-one correspondence between functions, hyperplanes, and decision rules. The functions in these classes may be called normalized.

  5. Denote the set of facts, correctly recognized by the decision rule }. It is convenient to call these facts accepts.

  6. SVM uses accepts of each function to normalize it. Denote class of functions which satisfy the condition

  7. Every function in has one and only one congruent function in

    Indeed, take Consider

    and Then, if are the parameters of the function the function with parameters will belong to the class . Any other function congruent to will not belong to .

  8. Consider a fact where This is the lowest absolute value of the function on any accept. According to the formula (1), the distance of the data point to the separating hyperplane is

    This distance is the lowest among the accepts. Accepts satisfying these conditions are called support vectors.

  9. Among the functions in with identical sets of accepts, the function with the smallest norm has largest distance between the support vectors and the hyperplane. The original idea of SVM was to find a linear separation of classes, when it exists, to maximize the distance of the accepts to the separating hyperplane.

    This explains why SVM minimizes But where the idea of maximizing the distance to hyperplane came from was never made clear.

  10. In the class of functions the minimal absolute value of the decision function on any accept is 1. It means, there are effectively two thresholds of the labeling transformation: The hypothetical cases where are considered not labeled. This labeling transformation for the class will be denoted It is defined on the thresholds:

Taking into account these facts and notations, the problem may be reformulated:

[title = Linear SVM]

This definition of the problem can be further simplified with the slack variables as well as conditions eliminated.

Theorem 2.

The Linear SVM problem in is equivalent to the problem [title = SVM.1]

Proof.

For a function , the component of Linear SVM objective with the slack variables

(2)

is subject to :

The conditions can be rewritten as

(3)

or

Therefore

(4)

Generally

Since all the terms of the sum (2) can achieve its minimum (4) independently,

Now, I do reverse engineering of the SVM to show that the SVM solves the Optimal Class Separation problem.

9.2 Direct Loss criterion in SVM

The criterion is defined with the norm and a non-decreasing scaling

In general case, the direct loss of a decision rule on a fact is defined by the formula:

Substituting the SVM scaling transformation and the thresholds

of the labeling transformation we get

(5)

Direct loss of the decision function on the sample is

The next theorem shows that if the first term in the SVM problem is omitted, SVM problem is transformed into minimization of the SVM direct loss.

Theorem 3.

For every

Proof.

By definition

So, we just need to prove that for every for every

There are 4 possible options, depending on the class label of and the assigned label :

  1. From the formula (5), In this case, and

  2. From the formula (5),

    In this case, and

  3. From the formula (5),

    In this case, so

    and

  4. In this case and

    From (5). as well.

The theorem demonstrates that the second component of the SVM.1 objective is the direct loss on the training set.

9.3 Proximity loss criterion for SVM

For this criterion, the distance on is Euclid; The norm is The scaling transformation

In general case, the proximity loss of a function on an accept is defined through the distance to the domain of the opposite class

The domain of the class 1 is defined by the inequality , the domain of the class -1 is satisfy the condition

From the general definition, for an accept ,

For all other cases,

In the class of functions for an accept , the distance to the domain of the opposite class

where is the distance to the hyperplane , and is the distance from the hyperplane to a hyperplane

From the formula (1)

So,

The proximity loss on the training set is:

Theorem 4.
Proof.

Since the proximity loss non-negative for every fact, and is non-zero only for accepts ,

In the class , so

Theorem 5.

Linear SVM is equivalent to the Optimal Class Separation problem

Proof.

It follows from the theorem 2, 3, 4. ∎

So, Linear SVM supports the main conjecture of this work as well.

10 Conclusions

Here is the main takeaway from this work so far.

  1. What is Intelligent Learning (IL) cycle?

    IL is a cycle of hypothesis development including ML as a step. IL starts with an insight and ends with testing of the hypothesis.

  2. What is ML?

    ML is an automated process reconciling aspects of the hypothesis mediation, including the training set and assumptions about the dependence between features and output. ML step produces an explicit representation of the hypothesis as a function.

  3. What is common between classification learners?

    The main conjecture of this work is that every classification learner solves Optimal Separation Problem, minimizing criteria of direct and or proximity loss. These criteria serve as formalization of the concept of learnability which is the main assumption about the dependence.

  4. Why are learners so different from each other?

    There are point-wise and step-wise learners, and every learner has its own parameters of the criteria.

  5. What counts as a success for ML step?

    If the decison rule has low values of learnalility criteria, ML step is successful.

  6. What counts as a success of IL cycle as a whole?

    Decision rule has low loss on new (test) data.

  7. Why does IL work with relatively small data?

    IL works if and only if it starts with correct ideas about the reality. IL can work even without data: sometimes, an intelligent agent can guess the decision rule. This is what happened with an example of ML problem extensively studied in the [11]: how to predict if papaya is tasty. In the end, authors just give us the rule, without going through the pain of feature specifications, and data gathering, and ML itself.

    On another hand, if some of the assumptions were wrong, even with infinitely large training set the results on the test set will be poor.

The approach toward understanding ML proposed here allows one to formulate many more questions.

  1. From the analysis of the learners like -NN, Naive Bayes, decision trees and Linear SVM it appears that the conjecture is correct, and Optimal Class Separation problem generalizes all the different problems learners of classification solve. But I am skeptical. Why would there be only two different criteria? It is interesting to find a (well working) learner which contradicts the conjecture.

  2. ML works when the properties of the distribution agree with the assumptions in foundation of a selected learner. For proper learner selection, it is important to discover and state explicitly the assumptions about the distribution for every learner.

  3. The criteria of learnability formulated here (direct and proximity loss) closely related with concept of robustness of an approximation problem: small variations in data shall not lead to large changes in a solution. It is interesting to explore the relationship between the learnability and robustness.

  4. It appears the proximity loss is relatively late invention: earlier learners did not use this criterion. What are the advantages of adding proximity loss?

  5. Empiric loss criterion used in ML theory is a Boolean function for every data point, and it fits classical logic just fine. For both direct and proximity loss, the estimate of truthfulness in each point is not binary, and for proximity loss it takes into account spacial relationship and distances. It is interesting to understand the logic with two such criteria of truthfulness.

  6. While every step of the IL may be considered as deduction, the philosophical logic behind a feedback loop is not well researched, as far as I know. It is interesting to understand it better in the context of knowledge theory.

  7. I demonstrated on an example, that testing can probabilistically invalidate decision rule. When and how does testing supports the decision rule? Generally, what kind of conclusions about the decision rule can we get from testing it on new data?

  8. The current work concern only the classification problem. What about other ML problems, like regression, ranking, clustering? Is it possible to discern some learnability criteria that the learners for these problems optimize?

Further research is needed to answer those questions.

References

  • [1] Induction and Deduction - Georgia Tech - Machine Learning. https://youtu.be/pqXASFHUfhs.
  • [2] Y.S Abu-Mostafa, Magdon-Ismail. M, and L. Hsuan-Tien. Learning From Data. AMLbook.com, 2012.
  • [3] D. Barber. Bayesian Reasoning and Machine Learning. Cambridge University Press, NY, 2012.
  • [4] B.E. Boser, I. M. Guyon, and V.N. Vapnik. A training algorithm for optimal margin classifiers. In

    COLT ’92: Proceedings of the fifth annual workshop on Computational learning theory

    .
  • [5] C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20:273 – 297, 1995.
  • [6] Luc Devroye, Laszlo Gyorfi, and Gabor Lugosi.

    A Probabilistic Theory of Pattern Recognition

    .
    Springer-Verlag New York, 1996.
  • [7] A. Esteva, B. Kuprel, R. Novoa, et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature, 542:115–118, 2017.
  • [8] T. Hastie, R. Tibshirani, and J. Friedman. Elements of statistical learning. Springer, 2009.
  • [9] F.W.F Hegel. Delphi Collected Works of George Willhelm Friedrich Hegel. Delphi Classics, 2019.
  • [10] Karl R. Popper. The Logic of Scientific Discovery. Martino Publishing, CT., 2014.
  • [11] Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning. Cambridge University Press, NY, 2014.
  • [12] V. N. Vapnik. The nature of statistical learning theory. Springer - Verlag, 1995.