It has been recently discovered [Ben-David et al. 2019] that machine learnability can be undecidable within ZFC (Zermelo-Fraenkel with axiom of Choice). For a given degree of approximation and value of residual probability , learnability of a concept over a certain class of probabilities consists of the existence of a learning algorithm, a learner, and a number such that with independent observations from anyone of the possible probabilities , the concept is approximated by the learner within an error of at most with probability larger than . [Ben-David et al. 2019] shows that there are situations in which learnability of a concept is independent of ZFC axioms systems, even restricting to finitely supported probabilities. It thus becomes relevant to explore conditions which ensure decidability of learning.
To this extent, it is convenient to look at things from the opposite side: given a sample size , an algorithm is not an -learner if there exists a probability , in the class , which violates, with independent observations, the degree of approximation with a probability exceeding the bound on residual probabilities. This way the problem has been turned into that of existence of a probability, within a certain class, satisfying certain additional conditions, besides those required for it to be a probability (and to be in the prescribed class).
To explore decidability, we consider the so called ”discretization trick” ([Shalev-Shwartz and Ben-David 2014] - Remark 4.1), according to which in virtually all concrete applications various limitations, such as using a computer to handle the data, introduce an a-priori bound on the number of possible states of the system. We then show that in this case, learnability can be expressed in terms of polynomial relations for the probabilities of specific events; we can then refer to the Tarski-Seidenberg Theorem for real closed fields [Bochnak, Coste, Roy 1998], which shows that existence of a solution for any finite set of polynomial relations is decidable. This provides an explicit, albeit computationally very expensive, algorithm for the determination of the sample complexity. Very accurate bounds have been developed for the sample complexity [Hanneke 2016], but, in view of [Ben-David et al. 2019], they provide no guarantee of decidability.
It is interesting to realize that the Tarski-Seidenberg Theorem does not indicate how to find a probability violating a given tentative learner , nor could it provide any exact method for it, as in general there is no finite algorithm to determine solutions of polynomial equations of degree greater than or equal to . It is also interesting to notice that the theories of natural [Gödel 1931] or rational numbers [Robinson 1949] are not decidable; so, if we insisted in restricting to probabilities taking rational values it would not be clear if learnability is decidable. On the other hand, we are not interested in any such restriction, or in the exact determination of the probabilities, but only in their existence, in order to exclude a tentative learner , or non existence, in order to assess that is a learner; and that’s exactly what the Tarski-Seidenberg Theorem guarantees. In other words, decidability is guaranteed if we discretize the inputs, but not the values of the modeling probabilities.
As a related topic, we mention that the issue of existence of a probability satisfying certain requirements can be given an interpretation in terms of Logic. In this context, one develops first the syntax of a logic, i.e. the allowed symbols and formulas; then, the collection of models in which the formulas can be interpreted represents the semantics of a logic. When models are probabilities of a certain class, then a formula is valid if it is true for all probabilities of the class: a logic for finite probabilities with rational coefficients has been developed in [Fagin, Halpern and Megiddo 1990]. Learnability can then be interpreted in terms of validity of the formulas expressing the fact that a certain is a learner. We briefly discuss and exploit this connection further in Section 4.
2 Polynomiality of learnability conditions and sample complexity
Sample complexity of agnostic PAC learning is defined as follows. Let be sets, indicating the set of features and labels, respectively, and let indicate a hypothesis set. To evaluate a hypothesis, we introduce a loss
, and a loss functiondefined by . When a probability is defined on , and
are two random variables taking values inand
, respectively, with joint distribution, the average loss of a hypothesis is . For , a potential -learner is a function , where we denote . Given a class of probabilities on (possibly suitable subsets of) , and , an -learner with respect to is a potential -learner such that
for all . The class is agnostically PAC learnable with respect to if for every there exists such that for all there exists an -learner . The sample complexity of agnostically PAC learning the class with respect to is the minimum of such ’s.
The sample complexity of agnostic PAC learnability with respect to the class of all probabilities with uniformly bounded support is decidable.
Suppose the support of each probability is bounded by some uniform constant , and consider first fixed finite sets and such that . In this case, the hypothesis class is finite as well. A finite hypothesis class is learnable ([Shalev-Shwartz and Ben-David 2014] Cor 4.6), hence for there exists such that for all (1) holds for some .
Next, for each , potential -learner , and hypothesis , we say that is not a -learner of if there exists a probability on such that
notice that is not a -learner if there is an such that is not a learner of ; we also say that a probability satisfying (2) violates . Condition (2) is polynomial in the probabilities ’s of some events , in the following sense: for , let be a copy of , and consider events , where are the realizations of the -th trial; next, let . Then, given , the existence of a probability satisfying (2) is easily seen to be equivalent to the existence of a solution of
System (3) is written in terms of unknowns ’s using polynomials and indicator functions. In order to apply the results of the next section, we need to eliminate the indicator functions. This can be done by considering subsets , and then observing that there is a solution to (3) if and only if there is a solution to at least one of the systems in the following collection labeled by ,
Since is an expected value, hence a linear condition, all the relations in each of the systems of the form (4) are polynomial in the variables ’s. We then have at most polynomial conditions to check. For each , this is is decidable within the theory of real-closed fields by Theorem 3.1 below. Hence the sample complexity of agnostic PAC learnability, with respect to set of all probabilities on , is decidable.
Finally, consider any set , a finite set of labels , and one of the finite subset such that ; is viewed as a possible support of a probability violating a potential -learner . Consider also an injective maps from to , and from to , respectively, where and are the fixed sets considered above. The action of on observations from and the existence of a probability on violating is determined by systems of the form (4), which are preserved by the above injective maps (assigning probability zero to all points in not in the image of ). So, whether is a learner or not for given can be determined by the fact that the there is a learner or not on , which we have seen is decidable. Hence, sample complexity of agnostic PAC learnability with respect to the class of all probabilities with support uniformly bounded by is decidable
In general, the exact value of the sample complexity can be only determined by a systematic examination, which is guaranteed to end by the above theorem.
Consider learning the maximum in a binary space , which we can then assume to be . Suppose that the hypothesis class is and we use the ERM learner.
For and we have , while the standard upper bound based on Hoeffding theorem gives .
For , the standard upper bound gives ; some known lower bounds give , and there is a matching upper bound but with an unkonwn constant [Hanneke 2016]; the sample complexity turns out to be .
3 Decidability of finitely many polynomial problems in finite probabilities by Tarski-Seidenberg Theorem
System (4) is a special case of a general situation which occurs often in elementary probability: there are an unknown probability ; a finite number of events, ; and then a finite number of polynomial relations that are to be satisfied by the probabilities of either the ’s or some of their boolean combinations.
Possibly using the disjunctive normal form, one can always reduce these problems to a collection of polynomial relations, equalities and inequalities, in variables which represent the probabilities of a finite set (or, equivalently, in terms of the probabilities of the atoms of the normal form).
Expressed in general terms, we arrive at a system of polynomial relations in the variables of the form
for some , where ’s are polynomials, , and stands for either of .
The existence and nonexistence of probabilities satisfying system (5), is decidable.
System (5) determines a semi-algebraic set, which is nonempty if and only if there are solutions satisfying all the equations. Whether a semi-algebraic set is empty or not is decidable with the following decision procedure. First, by the Tarski-Seidenberg Theorem a semi-algebraic set in is non empty if and only if its projection on is non empty (see e.g. [Basu, Pollack, Roy 2006], Theorem ); iterating, this procedure reduces the problem to semi-algebraic sets in . For these, every semi-algebraic set can be decomposed in finitely many basic semi-algebraic sets of the form , where is a polynomial, and is a collection of polynomials. Finally, whether each basic semi-algebraic set is non empty can be determined by a General Law of Signs, which consists of checking the signs of suitable combinations of the coefficients of the polynomials (see e.g. [Basu, Pollack, Roy 2006], Lemma ) ∎
. The last step is similar to the methods in Sturm’s Theorem or Descartes Law of Signs.
4 Computational complexity
It is shown in [Fagin, Halpern and Megiddo 1990], Theorem that when the coefficient of the polynomials are rational, as they would be in any implementation, there is a procedure, for deciding if a polynomial weight formula is satisfiable in a (finite) probability space, that runs in polynomial space. It is then easy to see that each of the systems (4) can be expressed as a polynomial weight formula in the language of [Fagin, Halpern and Megiddo 1990], Chapter : as set of primitive propositions we take with boolean operations defined as usual for subsets of ; weight terms are , with linear operations and multiplications allowed to make formulas. The semantics to these weight formulas is then given by probabilities on the set , and hence [Fagin, Halpern and Megiddo 1990], Theorem applies to the decision problem of each of the systems (4).
In terms of number of arithmetic operations, on the other hand, the implementation of Tarski-Seidenberg elimination and the General Law of Signs has very high complexity; a slightly better version is cylindrical decomposition (see e.g. [Basu, Pollack, Roy 2006], Ch. ), which is implemented in various software, but remains doubly exponential in the number of variables and of equations: for the System (5) it takes [Basu 2017] operations to decide whether a solution exists. So, the direct calculation of the sample complexity using this method is accessible only for problems with a very small a-priori bound ..
Expressed in terms of the number of pixels and colors in an image the number of arithmetic operations needed to determine the sample complexity of learning a hypothesis class would be a quadruple exponential; something of the order of for a , -color image and a sample of size .
5 Discussions and conclusions
We make noe a partial exploration of the source of the undecidability found in [Ben-David et al. 2019] when learning is seen from the point of view of existence/nonexistence of probabilities satisfying suitable conditions. This examination is hindered in [Ben-David et al. 2019], as in that paper learning is equivalently expressed in terms of compression schemes. The onset of undecidability is partially elucidated by the following.
Let be a -learner of EMX for finitely supported probabilities defined on a model of ZFC satisfying CH; when extended to a model containing and satisfying , determines a system of the form (4), with , which admits a solution.
One example of such extension is obtained from the use of the forcing method [Cohen 1963].
Consider a model of ZFC satisfying CH. It is shown in [Ben-David et al. 2019] that there is an and a learner of EMX over the collection of finitely supported probabilities in . For any given finite collection which could be used as support of a finite probability violating , determines a finite number of systems of the form (4), with , none of which has a solution (since is a learner).
Consider now an extension of satisfying . Suppose the learner is extended to the sequences such that some of the ’s do not belong to M. Then the existence of a probability violating the extension of is also determined by systems of the form (4), with , but now there must be a solution for at least one of such systems, as the extension of cannot be a learner.
To summarize the results of the paper, we have shown that, aside from the very high computational complexity of the decision procedure, the exact determination of the sample complexity in agnostic PAC learning, including EMX, is decidable under the ”discretization trick” (i.e. when the probabilities are known to be supported on a finite set with an a-priori bounded size). This result contrasts with the undecidability of learning discovered in [Ben-David et al. 2019] for learning the maximum with with respect to probabilities supported on a finite set (whose size has no a priori bound). We have also investigated the mechanism by which a learner developed in a model satisfying CH fails in any extension to a model in which CH ceases to hold.
- [Ben-David et al. 2019] Shai Ben-David, Pavel Hrubesss, Shay Moran, Amir Shpilka, and Amir Yehudayoff. Learnability can be undecidable. Nature Machine Intelligence 1, 1 (2019), 44-48.
- [Bochnak, Coste, Roy 1998] Bochnak, Jacek; Coste, Michel; Roy, Marie-Fran ccoise. Real Algebraic Geometry. Translated from the 1987 French original. Revised by the authors. Ergebnisse der Mathematik und ihrer Grenzgebiete (3) [Results in Mathematics and Related Areas (3)], 36. Springer-Verlag, Berlin, 1998.
- [Basu 2017] Saugata Basu. 2017. ALGORITHMS IN REAL ALGEBRAIC GEOMETRY: A SURVEY. Panoramas & Synth ses, 51, 2017, 107-153.
- [Basu, Pollack, Roy 2006] Saugata Basu, Richard Pollack, and Marie-Franccoise Roy. 2006. Algorithms in Real Algebraic Geometry (Algorithms and Computation in Mathematics). Springer-Verlag New York, Inc., Secaucus, NJ, USA.
- [Cohen 1963] Cohen, Paul J., 1963. The Independence of the Continuum Hypothesis”. Proceedings of the National Academy of Sciences of the United States of America. 50 (6): 1143-1148.
- [E2006] Richard L. Epstein Classical Mathematical Logic: The Semantic Foundations of Logic, Princeton University Press.
- [Fagin, Halpern and Megiddo 1990] Fagin, R., Halpern, J. H. and Megiddo N: A Logic for Reasoning about Probabilities, Inform. and Comput., 87, Nos. 1/2, 1990
- [Gödel 1931] Kurt Gödel (1931), ”Über formal unentscheidbare S tze der Principia Mathematica und verwandter Systeme, I.” Monatshefte für Mathematik und Physik 38, 173-198.
- [GKP1988] G. Georgakopoulos, D. Kavvadias, and C. H. Papadimitriou. Probabilistic satisfiability. Journal of Complexity, 4:1-11, 1988.
- [Hanneke 2016] S. Hanneke: The Optimal Sample Complexity of PAC Learning. Journal of Machine Learning Research 17 (2016) 1-15.
- [HM2001] Hazewinkel, Michiel, ed. (2001), ”Disjunctive normal form”, Encyclopedia of Mathematics, Springer,
- [Robinson 1949] J. Robinson (1949). Definability and decision problems in arithmetic. The Journal of Symbolic Logic, 14, 98-114.
- [Shalev-Shwartz and Ben-David 2014] Shalev-Shwartz, S. and Ben-David, S. (2014) Understanding Machine Learning: From theory to algorithms. Cambridge: Cambridge University Press.