 # A Formal Proof of PAC Learnability for Decision Stumps

We present a machine-checked, formal proof of PAC learnability of the concept class of decision stumps. A formal proof has every step checked and justified using fundamental axioms of mathematics. We construct and check our proof using the Lean theorem prover. Though such a proof appears simple, a few analytic and measure-theoretic subtleties arise when carrying it out fully formally. We explain how we can cleanly separate out the parts that deal with these subtleties by using Lean features and a category theoretic construction called the Giry monad.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

We present a machine-checked, formal proof of PAC learnability of the concept class of decision stumps. In this work, a formal proof carries a specific meaning that is distinct from rigorous. The concept is best explained by quoting Thomas C. Hales from the December 2008 special issue on formal proof of the AMS Notices: (Hales, 2008).

A formal proof is a proof in which every logical inference has been checked all the way back to the fundamental axioms of mathematics. All the intermediate logical steps are supplied, without exception. No appeal is made to intuition, even if the translation from intuition to logic is routine.

Although in principle such proofs could be written in human readable prose and checked manually, the only realistic way to achieve such a level of precision is to use a computer to help construct and check each step. The computer programs that help write and verify such proofs are called proof assistants. Roughly speaking, proof assistants provide a language to express mathematical statements and proofs in some logic, which are then fully checked.

Even with the help of such tools, the task of constructing a full formal proof may seem hopelessly daunting. However, in recent years, formal proofs have been written for challenging theorems such as the four color theorem (Gonthier, 2008), the Kepler conjecture (Hales et al., 2017)

, and the odd order theorem

(Gonthier et al., 2013). Formal proofs can also be used to prove the correctness of computer programs. For example, it is possible to prove that a compiler for a programming language preserves the semantics of the programs being compiled (Leroy, 2009).

We used the Lean theorem prover to write our proof. In Lean, type theory serves as the foundation for mathematics. Lean has an associated mathematics library, mathlib, which develops mathematics from the ground up, abstractly. The mathlib library contains many results in analysis, topology, measure theory, category theory, and more. In section 3, we introduce Lean and mathlib in more detail.

### 1.1 Motivation

Why formalize proofs from statistical learning theory? We have two motivations.

First, formal proofs can help reduce the uncertainty about whether critical machine learning applications are behaving as intended. A machine learning application can fail to generalize for many reasons: maybe the training data is insufficient, maybe there is a flaw in the design of the learning algorithm, or maybe there is an error in the implementation of the algorithm. Such errors can go unnoticed for long periods of time. This is particularly worrisome for machine learning applications for things like loan requests or hiring recommendations. Recently, there has been a great deal of work on enforcing specific measures of fairness for machine learning applications. We believe that formal proofs that software adheres to these standards will become critical, given their importance (Kohli et al., 2019). Recent work on formal proofs of correctness for machine learning software is in its early days (Selsam et al., 2017; Bagnall and Stewart, 2019) and will eventually need to use more advanced results from statistical learning theory. In section 2, we review the existing work on formal proofs for machine learning and randomized algorithms more generally.

Second, we want to formalize theorems from statistical learning theory for the same reasons that some mathematicians have decided to work on the formalization of mathematics. Such reasons range from the simple pleasure of proving a beautiful theorem with absolutely all of its details, to the belief that as mathematics becomes more complex and rich, we will need the help of computers to make progress. When Fields medalist Vladimir Voevodsky was asked whether he though that all mathematicians would end up using computers to create their proofs, he replied I can’t see how else it will go (Rehmeyer, 2013).

### 1.2 Contribution

The theorem we prove – that the concept class of decision stumps is PAC learnable – may seem rather simple. Indeed, it is even simpler than the problem of learning axis-aligned rectangles, which is used as a motivating example and exercise in many introductory texts on learning theory (Kearns and Vazirani, 1994; Shalev-Shwartz and Ben-David, 2014; Mohri et al., 2018). Despite its simplicity, we argue that there is a lot to learn from formalizing this proof.

To see why, let us sketch the proof. We define the set of examples to be . The concept class is the subset of defined as .222We use the notation to represent the function which takes an argument and returns the value . A function is called a decision stump. Because a decision stump is entirely determined by the boundary value it uses for decisions, we will refer to a stump and its boundary value interchangeably.

We want to prove that the concept class is PAC learnable, which we state informally below.

[Informal] There exists a learning function and a sample complexity function such that for any distribution over , , , and , when running the learning function on i.i.d. samples from labeled by , returns a hypothesis

such that, with probability at least

,

 μ({x∈X∣h(x)≠c(x)})≤ϵ

We will adapt the proof of PAC learnability for axis-aligned rectangles from Kearns and Vazirani (1994).333The proof for learning rectangles repeats the argument for decision stumps four times to establish bounds for the four sides of the rectangle and combines them with the union bound. Given and , consider labeled samples . The learning function returns the hypothesis

 λx.1(x≤max{Xi∣li=1})

The key idea in the proof is to find an interval such that, so long as at least one of the falls into , the hypothesis returned by will have error . If , then the bound is trivial, so assume . Set , choosing so that encloses exactly probability mass under . If the boundary point selected by is in , then the error will be less than , so for the error to be above means that none of our samples came from . The probability of such an event is at most . Having bounded the probability that the error will be over , the rest of the proof follows straightforwardly.

The careful reader may notice that there is one subtle step in the above: how do we choose to ensure that “ encloses exactly probability mass under ”? The phrasing “encloses exactly” comes from Kearns and Vazirani (1994) (page 4), which does not say how to prove that exists, beyond giving some geometric intuition in which we visualize shifting the left edge of until the measure encloses the specified amount. Shalev-Shwartz and Ben-David (2014) similarly instructs us to select so that the measure “is exactly” .444The cited references address the more general problem of axis-aligned rectangles instead of stumps, so more specifically they describe shifting the edge of a rectangle until the enclosed measure is .

Unfortunately, the argument is not correct, because such a may not exist. Since PAC learning is distribution free, we need to account for all distributions , including those with a discrete component. Take

to be the Bernoulli distribution with

, , and . Then the desired does not exist.555We are not the first to observe this error. The errata for the first printing of Mohri et al. (2018) points out the issue in the proof of Kearns and Vazirani (1994).

Instead, the point we are looking for should be defined as

 θ=sup{x∈X∣μ[x,c]≥ϵ}

and we not only need to prove that satisfies but also that . The original proof of the PAC learnability of the class of rectangles did carefully define such a point (Blumer et al., 1989), as does the textbook by Mohri et al. (2018), although neither gives a proof for why the the point defined this way has the desired properties. Indeed, Mohri et al. (2018) say that it is “not hard to see” that these properties hold.

In fact, this turned out to be the most difficult part of the whole proof to formalize. While it only requires some basic results in measure theory and topology, it is nevertheless the most technical step of the argument. There were two other parts of the proof that seemed obvious on paper but turned out to be much more technically challenging than expected, having to do with showing that various functions are measurable. Often, details about measurability are elided in pencil-and-paper proofs. This is understandable, because these measurability concerns can be tedious and trivial, and checking that everything is measurable can clutter an otherwise insightful proof. However, many important results in statistical learning theory do not hold without certain measurability assumptions, as discussed by Blumer et al. (1989) and Dudley (2014, chapter 5).

In discussing the above errors, we are not trying to exaggerate their importance or suggest that they are serious. Indeed, in all cases, the proofs can be fixed, and the results follow from more general theorems about VC-dimension discussed later in the referenced books. Our point is rather to emphasize that even simple proofs in SLT can touch on subtle analytic issues. Formalizing these proofs is worthwhile to ensure there are no gaps or implicit assumptions, especially in light of the increasingly important applications mentioned in subsection 1.1.

A key component of our work is that we structure our formal proof in a manner that lets us separate out the high-level reasoning described in the sketch above from the low-level details about things like measurability. For example, in the above, we described as a function from samples to a hypothesis. However, to analyze its probabilistic behavior, we actually need to consider the push-forward of the measure on samples that is induced by this function. Constantly having to lift functions to this push-forward measure can become tedious, especially when we need to compose several such functions together. Instead, we use a construction called the Giry monad (Giry, 1982) which lets us concisely describe learning algorithms in a pseudocode-like manner as a sequence of steps. Similarly, we use a Lean feature called typeclasses to be able to discuss operations on measure spaces without having to repeatedly specify which sigma-algebra structure we are using on the associated sets. Instead they are automatically inferred by Lean based on context. The end result is that we are able to present a formal proof that captures the important high-level details, while still ensuring that all measure-theoretic subtleties are separately checked.

## 2 Related Work

As mentioned in the introduction, machine-checked proofs have been carried out for a wide range of mathematical results. Here, we mention some formalizations of results from probability theory or randomized algorithms.

Classic results about the average case behavior of quicksort and binary search trees have been formalized by a number of authors using different proof assistants (van der Weegen and McKinna, 2008; Eberl et al., 2018; Tassarotti and Harper, 2018). In each case, the authors write down the algorithm to be analyzed using a variant of the monadic style we discuss in section 4. For the most part, these formalizations only use discrete probability theory, with the exception of Eberl et al. (2018)’s analysis of treaps, which requires general measure theoretic probability. They report that dealing with measurability issues adds some overhead compared to pencil-and-paper reasoning, though they are able to automate many of these proofs.

Several projects have formalized results from cryptography, which also involves probabilistic reasoning (Petcher and Morrisett, 2015; Barthe et al., 2013, 2009; Blanchet, 2006). A challenge in formalizing such proofs lies in the need to establish a relation between the behavior of two different randomized algorithms, as part of the game-playing approach to cryptographic security proofs. Because cryptographic proofs generally only use discrete probability theory, these libraries do not formalize measure theoretic results.

There have been general formalizations of measure theoretic probability theory in a few proof assistants. Hurd (2003) formalized basic measure theory in the HOL proof assistant, including a proof of Caratheordory’s extension theorem. Hölzl and Heller (2011) developed a more substantial library in the Isabelle theorem prover, which has since been extended further. Avigad et al. (2014)

used this library to formalize a proof of the Central Limit Theorem.

More recent work has formalized theoretical machine learning results. Selsam et al. (2017) use Lean

to prove the correctness of an optimization procedure for stochastic computation graphs. They prove that the random gradients used in their stochastic backpropagation implementation are unbiased. In their proof, they add axioms to the system for various basic mathematical facts. They argue that even if there are errors in these axioms that could potentially lead to inconsistency, the process of constructing formal proofs for the rest of the algorithm still helps eliminate mistakes.

Bagnall and Stewart (2019)

give machine-checked proofs of bounds on generalization errors. They use Hoeffding’s inequality to obtain bounds when the hypothesis space is finite or there is a separate test-set. They apply this result to bound the generalization error of ReLU neural networks with quantized weights. Their proof is restricted to discrete distributions and adds some results from probability theory as axioms (Pinsker’s inequality and Gibb’s inequality).

Bentkamp et al. (2019) formalize a result by Cohen et al. (2016)

, which shows that deep convolutional arithmetic circuits are more expressive than shallow ones, in the sense that shallow networks must be exponentially larger in order to express the same function. Although convolutional arithmetic circuits are not widely used in practice compared to other artificial neural networks, this result is part of an effort to understand theoretically the success of deep learning.

Bentkamp et al. (2019) report that they actually proved a stronger version of the original result, and doing so allowed them to structure the formal proof in a more modular way. The formalization was completed only 14 months after the original arXiv posting by Cohen et al., suggesting that once the right libraries are available for a theorem prover, it is feasible to mechanize state of the art results in some areas of theoretical machine learning in a relatively brief period of time.

A related but distinct line of work applies machine learning techniques to automatically construct formal proofs of theorems. Traditional approaches to automated theorem proving rely on a mixture of heuristics and specialized algorithms for decidable sub-problems. By using a pre-existing corpus of formal proofs, supervised learning algorithms can be trained to select hypotheses and construct proofs in a formal system

(Bansal et al., 2019; Huang et al., 2019; Kaliszyk et al., 2017; Selsam and Bjørner, 2019).

## 3 Background

### 3.1 The Lean Proof Assistant

The Lean theorem prover can be viewed as both a functional programming language (like Haskell) and a foundation for mathematics, based on dependent type theory. Type theories are an alternative to Zermelo-Frankel set theory where where types are associated with mathematical expressions, in the same way that types can be used in programming languages, but with much stronger guarantees. Before we introduce the concept of dependent types, it is useful to consider a simple example of mathematical formalization in Lean.

Using Lean as a programming language, we can define a function double that takes a natural number as input and multiplies it by 2.

def double(n: nat): nat := 2 * n


This definition is similar to what one would find in any modern functional programming language. However, there is one significant difference between programming in Lean and those languages: in order to ensure that Lean is a consistent foundation for mathematics, functions cannot have side effects (printing on the screen, reading a file) and they must be proven to always terminate.

Next, we can define a predicate that formalizes the concept of an even number.

def isEven(n: nat): Prop := exists k: nat, n = 2 * k


This example clearly shows how Lean differs from a programming language. The function we define does not return simple data like a number or string, but instead a logical proposition that states that a natural is even if there exists a natural such that .

Finally, Lean let’s us specify mathematical properties and prove them. For example, the following states and proves a lemma called doubleIsEven that says that the result of double is always even:

lemma doubleIsEven: forall n: nat, isEven (double (n)) :=
begin
intros,
unfold isEven,
unfold double,
existsi n,
trivial,
end


The first line is the mathematical statement we wish to prove What follows the “:=” and enclosed by the keywords “begin” and “end” is a set of commands, called tactics, that describes the proof in a manner that Lean can check. The programmer constructs this tactic proof interactively: their IDE displays a list of current assumptions and what remains to be proved. This is represented by a sequent, which is a tuple of the form , where is the list of hypotheses and variables (called the context) and is a proposition (called the target). When the proof starts, the sequent is . Executing the tactic “intros” transforms the sequent into where is now a fixed but arbitrary natural number. Executing the tactic “unfold” applied to isEven unfolds the definition of isEven to give the sequent . Likewise, by unfolding double, we obtain the sequent . At that point, we need to exhibit a choice for that satisfies the property, which we can do using the tactic “existi” applied to , which is in the context. This gives the sequent which we can prove with the tactic called “trivial” that ensures that indeed have now reached a basic axiom (namely, that equality is reflexive).

Dependent types allow a function’s type to be parameterized by its input. For example, assume that you have a function

that sums the values of a vector of reals. That is, it takes a value

for some and returns a value in . Informally, we might write the type of such a function as but this is somewhat problematic since is a free variable. Dependent types allow us to represent the fact that the vector’s size depends on , by stating has type . The first variable is introduced here with the notation to indicate that it is a value (a natural number) and the rest of the type can refer to it. We use this to represent learning algorithms that take in a vector of training examples, to capture the fact that the size of the vector can vary.

### 3.2 The mathlib Mathematics Library

The mathlib library is a large library of mathematical results formalized in Lean, written by a number of collaborators. Many results are stated in a highly abstract form, with special cases derived as a consequence. For example, many results about real numbers are deduced from more general facts about Archimedean fields.

The mathlib library contains most of what one needs to formalize statistical learning theory. It contains of formalization of reals using Cauchy sequences, a significant amount of results in real and complex analysis, topology, and basic set theory. More importantly, it contains a formalization of measure theory, based on Hölzl and Heller (2011)’s library from the Isabelle theorem prover.

Unfortunately, mathlib does not a have a probability theory library, and in order to formalize our result, we had to develop one, as a special case of measure theory. This development accounts for about 2,500 lines of Lean formalization.

## 4 Formalizing Statistical Learning Theory with the Giry Monad

The Giry monad lets us rigorously formalize certain common informal arguments in probability theory. Before explaining the monad in more detail, let us give an example of this kind of informal argument.

In probability theory, it is common to treat a random variable, which formally is a function on the sample space, as if it were an element of its codomain. For example, let

be a distribution over and let be the function from to that multiplies its input by 2, that is . If is a sample from , then is an even number. This is trivial to prove: for any , is even. Here, plays two roles. As a sample, it is technically a random variable, that is, a measurable function. But when we consider we are acting as if it is a natural number. However, what we really mean is to consider the random variable , and then observe that for all in the underlying sample space, is even. This is similar to what happens when considering the addition of two random variables and , which we traditionally write as just .

Of course, this convention is well-understood by humans, but formally, Lean would reject the expression as ill-typed, because the type of is not . Although replacing expressions like or with or is not so difficult, for more complicated expressions this becomes tedious and clutters the argument with technicalities.

The Giry monad solves this problem, by providing a kind of “embedded programming language” for describing stochastic procedures.

The Giry monad is a triple . For any measurable space , is the space of probability measures over . The function , often called “bind”, is of type . That is, it takes a probability measure on , a function that transforms values from into probability measures over , and returns a probability measure on . Function , often called “return” is of type . It takes a value from and returns a probability measure on . These functions are defined as follows.

 β(μ,f)(A) =∫x∈Xf(x)(A)dμ for μ a distribution over X (1) ρ(x)(A) =χA(x) (2)

Therefore, is simply the delta-Dirac distribution at . To understand , consider the following example. Let be a distribution over over and consider the random variables and . For example, could be the function that for an input returns the distribution . What is the distribution of ? By the sum rule of probability we have . Therefore, . That is, is simply computing the distribution that results from applying while marginalizing over .

Finally, note that and satisfy the following properties (known as monad laws):

 β(ρx,f) =ρ(f(x)) left identity (3) β(μ,λx.x) =μ right identity (4) β(β(μ,f),g) =β(μ,λx.β(fx,g)) associativity (5)

when the functions and are measurable.

It is common to use the notation for . With this notation, the intuition is that we should think of this expression as a process that first draws a sample from and then applying to the sample called . Similarly, we use the notation for , thinking of this Dirac distribution as the process that always returns .

As an example, let be a probability measure over measurable space , we can use the Giry monad to define the pushforward of through , noted , as and we can verify that

 f∗(μ)(A) =β(μ,λx.ρf(x))(A) (6) =∫x∈Xρf(x)(A)dμ (7) =∫x∈XχA(f(x))dμ (8) =μ(f−1(A)) (9)

More formally, given two measurable spaces and , a probability measure on , and a measurable function from to , is a probability measure with the following property ,

 f∗(μ)(A)=μ(f−1(A))

We will make use of another important property of the Giry monad. Let be a measurable space and be the measurable space where is the cartesian product of ( times) and

is the tensor product of

( times). Let be a probability measure on . Then, the measure defined as

 μ1 =μ (10) μn =v←μn−1;ω←μ;ret(ω,v) (11)

is a probability measure and a product measure on , that is

 μn(n∏i=1Ei)=n∏i=1μ(Ei)

### 4.2 Formalizing Decision Stumps with the Giry Monad

With the Giry monad, it is easier to state formally that decision stumps are PAC learnable. We assume that the support is equipped with the standard topology and recall that are the Borel sets.

Let be the class of decision stumps. There exists a measurable function , called the learning function, and a sample complexity function such that for any probability measure on the measurable space , for any target function , for any , and for any

 A∗(c∗(μn)){h∈H∣μ{x∈R+∣h(x)≠c(x)}≥ϵ}≥1−δ

## 5 Walkthrough of the Formal Proof

We now describe the proof at a high level. The definitions and lemmas presented in this section are directly taken from the Lean formalization. We include some informal proofs of theorems when they are interesting or required unexpected effort to formalize. The complete proof can be found online at https://github.com/jtristan/stump-learnable.

### 5.1 Definitions

We use the following notation throughout the formalization to refer to the non-negative real numbers.

notation ‘‘ := nnreal


This set is both the set of all examples and our hypothesis class.

In normal mathematical writing, one often associates a particular mathematical structure, such as a topology or sigma-algebra with a given set, with the convention that that structure should be used throughout. For example, when talking about continuous functions from , we do not constantly clarify that we mean continuous functions with respect to the topology generated by the Euclidean metric on .

To mimic this style of mathematical writing, Lean has a feature called “typeclasses”. A typeclass is a kind of mathematical structure (like a topology or sigma-algebra). After defining a typeclass, the user can declare instances of that typeclass, which associate a default structure with a given type. This mechanism is used throughout mathlib to supply default topologies, ring structures, and so on with particular types. For example, the command below declares an instance of the measurable_space typeclass for the type :

instance meas_: measurable_space  := ...


where we have omitted the definition after the := sign. After this instance is declared, any time we refer to in a context where we need a sigma-algebra, this instance will be used. mathlib comes with lemmas to automatically derive instances of measurable_space from other instances. For example, if there is a topology associated with a type, we can automatically derive the Borel sigma algebra as an instance of measurable_space for that type. We use this Borel sigma-algebra on H above. Similarly, we can derive a product sigma algebra on the product of two types A × B from existing instance for A and B, as in the following example:

instance meas_lbl: measurable_space (  bool)


Another common part of mathematical writing is to declare at the beginning of a section of text that throughout the rest of the section, a certain variable will represent some mathematical object. For example, we might write “in this chapter, will be a real vector space”. The formal meaning of this is that we should read every subsequent result in that chapter which mentions as universally quantifying over . Lean has a similar sectioning mechanism and a way to declare such variables. In our formalization, the lines below declare probability measures over the class of examples, and an arbitrary target threshold value for labeling samples:

variables (: probability_measure ) (target: )


The following function labels a sample according to the target.

def label (target: ):     bool :=
x: , (x,rle x target)


where rle x target returns true if and false otherwise.

Finally, we can define the error event and its corresponding function.

def error_set (h: ) := {x:  | label h x  label target x}

def error :=  h,  (error_set h target)


The learning function will take a vector of labeled examples as input and output a hypothesis. Given a type A and natural n, the type vec A n represents vectors of size n+1 of values of type A. The function vec_map takes a function as an argument and applies it pointwise to the elements of the vector. We use this to label the inputs to the learning function:

def label_sample := vec_map (label target)


Our learning function starts by transforming any negative example to 0 and then stripping off the labels, with the following function:

def filter := vec_map ( p: (  bool), if p.snd then p.fst else 0)


This is safe since, if there were no positive examples, the learning function would return 0, and it makes the formalization slightly simpler. Finally, we can define our learning function, choose, which picks the biggest positive example.

def choose (n: ):vec (  bool) n   :=
data: (vec (  bool) n), max n (filter n data)


Note that the learning function is not computable and has to be declared as such in the development because mathlib’s reals are formalized as Cauchy sequences.

### 5.2 Measurability

About a quarter of the formalization consists in proving that various sets and functions are measurable. The predicate is_measurable S states that the set S is a measurable set, while measurable f states that the function f is measurable. These proofs can be long but are generally routine, with a few notable exceptions. The first is the proof that the function that computes the error rate of a hypothesis is measurable:

lemma error_measurable: measurable (error  target)


First, note that if then error μ target = . Likewise, if then error μ target = . Therefore proving that error μ target is a measurable function amounts to proving that the function is a measurable function. This is a standard result in measure theory, but was missing from mathlib.

Next, one must show that the learning function is measurable, after fixing the number of input examples:

lemma choose_measurable: measurable (choose n)


To prove that choose is a measurable function, we must prove that max is a measurable function. Because max is continuous, it is Borel measurable.

Although the previous proof is straightforward, it hinges on the fact that the sigma-algebra structure we associate with vec nnreal n is the Borel sigma-algebra. But, because we define a vector as an iterated product, another possible sigma-algebra structure for vec nnreal n is the n-ary product sigma algebra.

Recall from the previous section that our development uses Lean’s typeclass mechanism to automatically associate product sigma algebras with product spaces, and Borel sigma-algebras with topological spaces. As the preceding paragraph explains, for vec nnreal n there are two possible choices. Which choice should be used? In programming languages with typeclasses, the problem of having to select between two potentially different instances of a typeclass is called a coherence problem. Of course, this same potential ambiguity arises in normal mathematical writing, when we omit mentioning the associated sigma-algebra.

Fortunately, in the case of vec nnreal n, these two sigma-algebras happen to be the same. In general, if and are second-countable topological spaces, then the Borel sigma-algebra on is equal to the product of the Borel sigma-algebras on and . Thus, although the proof of measurability for can be simple, it uses a subtle fact that resolves the ambiguity involved in referring to sets without constantly mentioning their sigma-algebras.

### 5.3 Decision Stumps are PAC Learnable

With these preliminaries dealt with, we are now ready to formalize a (corrected) form of the proof of PAC learnability that we sketched in the introduction. We begin by defining a function for the sample complexity that we will establish:

def complexity (: ) (: ) :  := (log() / log(1 - )) - (1: nat)


We then use the Giry monad to describe the measure on hypotheses that results from running the algorithm:

def denot: probability_measure  :=
let  := vec.prob_measure n   in
let  := map (label_sample target n)  in
let  := map (choose n)  in



Our goal then is to prove the following theorem:

theorem choose_PAC:
: nnreal,  : nnreal,  n: ,
> 0   < 1   > 0   < 1
(n: ) > (complexity  )
(denot  target n) {h:  | @error  target h  }  1 -


As our proof sketch suggests, we will need to divide up the sample space into various intervals. If and be two reals, then we write Ioo a b, Ioc a b, Ico a b, Icc a b in Lean to refer to the intervals , , , and , respectively. To start the proof, we first observe that the error probability of a hypothesis is the measure of the interval between it and the target.

lemma error_interval_1:
h, h  target  error  target h =  (Ioc h target)


We then show that our complexity function has the following property:

lemma complexity_enough:
: nnreal,  : nnreal,  n: ,
> (0: nnreal)   < (1: nnreal)   > (0: nnreal)   < (1: nnreal)
(n: ) > (complexity  )  ((1 - )^(n+1))


Recall from the informal sketch in the introduction that the proof hinged around constructing a point such that . As we discussed there, such a may not exist. Instead, we must construct so that , and :

theorem extend_to_epsilon_1:
: nnreal,  > 0
(Ioc 0 target) >
: nnreal,  (Icc  target)     (Ioc  target)


We take to be . The supremum exists because the set in question is bounded and contains , so it is non-empty. To see that , we can construct a sequence of points such that with the property that . We have then that:

 ⋂i[xi,target]=[θ,target]

Because measures are continuous from above, it follows that

 μ([θ,target])=limn→∞μ([xn,target])≥ϵ

The proof that is similar, using continuity from below.

With this lemma, we can finish the proof:

[choose_Pac]

Since choose selects a hypothesis h which is the largest of the positive examples, by error_interval_1 we just have to show that at least one of these examples will be close enough to target. Here, we split into two cases. When , the error rate bound is trivial to establish, since choose has to return something in the interval . For the other case, when , we can apply extend_to_epsilon_1 to get a suitable . Because , for the error rate to be , all positive training examples must be less than . But since , the probability that an example is negatively labeled or less than is at most . Recall that the training input type, vec nnreal n is a vector of length . Therefore, the probability that all examples are negatively labeled or less than is then at most . Applying complexity_enough, we have that , so we are done.

## 6 Conclusion

We have presented a machine-checked, formal proof of PAC learnability of the concept class of decision stumps. The proof is formalized using the Lean theorem prover. We explained and used the Giry monad to keep the formalization simple and close to a pencil and paper proof. To formalize this proof, we specialized the measure theory formalization of the mathlib library to the necessary basic probability theory. As expected, the formalization is at times subtle when we must consider topological or measurability results, mostly to prove that the learning algorithm and the generalization error are measurable functions. The most technical part of the proof has to do with proving the existence of an interval with the appropriate measure, a detail that previous proofs either ignore or get wrong.

Our work shows that the Lean prover and the mathlib library are mature enough to tackle a simple but classic result in statistical learning theory. A next step would be to formally prove more general results from VC theory.

We would like to acknowledge support for this project from the National Science Foundation (NSF grant IIS-9988642) and the Multidisciplinary Research Program of the Department of Defense (MURI N00014-00-1-0637).

## References

• Avigad et al. (2014) Jeremy Avigad, Johannes Hölzl, and Luke Serafin. A formally verified proof of the Central Limit Theorem. CoRR, abs/1405.7012, 2014. URL http://arxiv.org/abs/1405.7012.
• Bagnall and Stewart (2019) Alexander Bagnall and Gordon Stewart. Certifying the true error: Machine learning in Coq with verified generalization guarantees. In

AAAI’19: The Thirty-Third AAAI Conference on Artificial Intelligence

, 2019.
• Bansal et al. (2019) Kshitij Bansal, Sarah Loos, Markus Rabe, Christian Szegedy, and Stewart James Wilcox. HOList: An environment for machine learning of higher order logic theorem proving. In Thirty-sixth International Conference on Machine Learning (ICML), 2019.
• Barthe et al. (2009) Gilles Barthe, Benjamin Grégoire, and Santiago Zanella Béguelin. Formal certification of code-based cryptographic proofs. In POPL, pages 90–101, 2009.
• Barthe et al. (2013) Gilles Barthe, François Dupressoir, Benjamin Grégoire, César Kunz, Benedikt Schmidt, and Pierre-Yves Strub. Easycrypt: A tutorial. In Foundations of Security Analysis and Design VII - FOSAD 2012/2013 Tutorial Lectures, pages 146–166, 2013.
• Bentkamp et al. (2019) Alexander Bentkamp, Jasmin Christian Blanchette, and Dietrich Klakow. A formal proof of the expressiveness of deep learning.

Journal of Automated Reasoning

, 63(2):347–368, 2019.
• Blanchet (2006) Bruno Blanchet. A computationally sound mechanized prover for security protocols. In 2006 IEEE Symposium on Security and Privacy (S&P 2006), 21-24 May 2006, Berkeley, California, USA, pages 140–154, 2006.
• Blumer et al. (1989) Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K Warmuth. Learnability and the Vapnik-Chervonenkis dimension. Journal of the ACM (JACM), 36(4):929–965, 1989.
• Cohen et al. (2016) Nadav Cohen, Or Sharir, and Amnon Shashua. On the expressive power of deep learning: A tensor analysis. In Proceedings of the 29th Conference on Learning Theory, COLT 2016, New York, USA, June 23-26, 2016, pages 698–728, 2016.
• Dudley (2014) R. M. Dudley. Uniform Central Limit Theorems. Cambridge Studies in Advanced Mathematics. Cambridge University Press, 2nd edition, 2014.
• Eberl et al. (2018) Manuel Eberl, Max W. Haslbeck, and Tobias Nipkow. Verified analysis of random trees. In ITP, 2018.
• Giry (1982) Michèle Giry. A categorical approach to probability theory. In B. Banaschewski, editor, Categorical Aspects of Topology and Analysis, volume 915 of Lecture Notes in Mathematics, pages 68–85, 1982.
• Gonthier (2008) Georges Gonthier. Formal proof–the four-color theorem. Notices of the AMS, 55(11):1382–1393, December 2008.
• Gonthier et al. (2013) Georges Gonthier, Andrea Asperti, Jeremy Avigad, Yves Bertot, Cyril Cohen, François Garillot, Stéphane Le Roux, Assia Mahboubi, Russell O’Connor, Sidi Ould Biha, et al. A machine-checked proof of the odd order theorem. In International Conference on Interactive Theorem Proving, pages 163–179. Springer, 2013.
• Hales et al. (2017) Thomas Hales, Mark Adams, Gertrud Bauer, Tat Dat Dang, John Harrison, Hoang Le Truong, Cezary Kaliszyk, Victor Magron, Sean McLaughlin, Tat Thang Nguyen, et al. A formal proof of the Kepler conjecture. Forum of Mathematics, Pi, 5, 2017.
• Hales (2008) Thomas C Hales. Formal proof. Notices of the AMS, 55(11):1370–1380, December 2008.
• Hölzl and Heller (2011) Johannes Hölzl and Armin Heller. Three chapters of measure theory in Isabelle/HOL. In ITP, pages 135–151, 2011.
• Huang et al. (2019) Daniel Huang, Prafulla Dhariwal, Dawn Song, and Ilya Sutskever. Gamepad: A learning environment for theorem proving. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, 2019.
• Hurd (2003) Joe Hurd. Formal Verification of Probabilistic Algorithms. PhD thesis, Cambridge University, May 2003.
• Kaliszyk et al. (2017) Cezary Kaliszyk, François Chollet, and Christian Szegedy. HolStep: A machine learning dataset for higher-order logic theorem proving. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017.
• Kearns and Vazirani (1994) Michael J Kearns and Umesh Virkumar Vazirani.

An introduction to computational learning theory

.
MIT press, 1994.
• Kohli et al. (2019) Pushmeet Kohli, Krishnamurthy Dvijotham, Jonathan Uesato, and Sven Gowal. Identifying and eliminating bugs in learned predictive models, 2019.
• Leroy (2009) Xavier Leroy. Formal verification of a realistic compiler. Communications of the ACM, 52(7):107–115, 2009.
• Mohri et al. (2018) Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT press, 2018.
• Petcher and Morrisett (2015) Adam Petcher and Greg Morrisett. The foundational cryptography framework. In POST, pages 53–72, 2015.
• Rehmeyer (2013) Julie Rehmeyer. Voevodsky’s mathematical revolution. Blog of Scientific American, October 2013.
• Selsam and Bjørner (2019) Daniel Selsam and Nikolaj Bjørner. Guiding high-performance SAT solvers with unsat-core predictions. In Theory and Applications of Satisfiability Testing - SAT 2019 - 22nd International Conference, SAT 2019, Lisbon, Portugal, July 9-12, 2019, Proceedings, pages 336–353, 2019.
• Selsam et al. (2017) Daniel Selsam, Percy Liang, and David Dill. Developing bug-free machine learning systems with formal mathematics. In International Conference on Machine Learning (ICML), 2017.
• Shalev-Shwartz and Ben-David (2014) Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.
• Tassarotti and Harper (2018) Joseph Tassarotti and Robert Harper. Verified tail bounds for randomized programs. In ITP, 2018.
• van der Weegen and McKinna (2008) Eelis van der Weegen and James McKinna. A machine-checked proof of the average-case complexity of quicksort in Coq. In TYPES, pages 256–271, 2008.