A Constructive, Type-Theoretic Approach to Regression via Global Optimisation

06/23/2020 ∙ by Dan R. Ghica, et al. ∙ 0

We examine the connections between deterministic, complete, and general global optimisation of continuous functions and a general concept of regression from the perspective of constructive type theory via the concept of 'searchability'. We see how the property of convergence of global optimisation is a straightforward consequence of searchability. The abstract setting allows us to generalise searchability and continuity to higher-order functions, so that we can formulate novel convergence criteria for regression, derived from the convergence of global optimisation. All the theory and the motivating examples are fully formalised in the proof assistant Agda.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

For some given objective function and set of equalities, inequalities, or arbitrary constraints , the central goal of global optimisation is to compute, with mathematical guarantees, the global minimum of subject to . Global optimisation has numerous obvious applications in all areas of engineering and computational sciences, as it gives a general recipe for solving problems of arbitrary complexity. As an area of research, the study of global optimisation algorithms is mature, with a recent survey indicating more than twenty textbooks and research monographs in the last few decades [14].

Global optimisation algorithms fall under several categories, but in this paper we will focus on algorithmis that are:

General:

Algorithms may take into account information about the shape of the function. For example, the minimisation of functions with convex envelopes is intensively studied [30]. In contrast, we will make minimal assumptions of this nature.

Complete:

An incomplete algorithm makes no guarantees regarding the quality of the solution it arrives at, focussing on efficiency via sophisticated heuristics, rather than correctness. The typical example of an incomplete algorithm is

gradient descent, which will only find a local minimum of a function [25]. In contrast, we provide mathematical guarantees a solution is indeed optimal within some margin of error.

Deterministic:

A randomized algorithm can offer an asymptotic guarantee that the optimum is reached, with probability one, without actually knowing when it has been reached 

[29]. In contrast, we will give strong termination guarantees for the algorithm.

Continuous:

Many global optimisation algorithms deal with discrete problems, such as branch-and-bound [16]. In contrast, we will focus on the minimisation of continuous functions.

To summarise, in this paper we will concentrate on general, complete, continuous, deterministic global search, which finds one guaranteed optimal-within-epsilon global minimum of a continuous function [20]. In the sequel, by ‘optimisation’ this is what we mean, precisely.

The first important results in the area relevant to our work appeared in the 1960s and 70s, the optimisation of rational functions using interval arithmetic by Moore and Young [18], which was then generalised to Lipschitz-continuous functions by Piyavskii [23]

. The idea of the algorithm is rather simple. By splitting the domain of the function into intervals, we impose a certain degree of precision on the horisontal axis. The Lipschitz constant will then bound the growth of the function on each interval, thus allowing us to calculate a precision on the vertical axis. In effect, we can ‘discretise’ the function with known precision along both imput and output, which will make the problem decidable. It also allows the application of efficient discrete algorithms such as branch-and-bound for continuous optimisation 

[28].

One of the important and immediate applications of optimisation is regression

, broadly construed. It means finding some parameters for a model so that a target error (loss) function is minimised. This connection is so intuitive and obvious that it is rather surprising that it is not expressed more emphatically in the literature. This broad formulation of regression captures not just conventional regression problems (linear regression, polynomial regression, etc.) but virtually all machine learning algorithms that are sometimes referred to as ‘curve fitting’ 

[21].

A new approach: searchable types

The inspiration for our new approach to global search and regression is in earlier work on ‘searchability’ [9, 10], concerning the construction of algorithms (‘selection functions’) for finding elements in compact spaces satisfying a (computable) predicate. Finite sets are trivially compact, and so are trivially searchable. However, certain infinite sets are also searchable by Tychonoff’s theorem, which states that the product space of any set of compact spaces is itself compact. The infinite product of a set is given by the function space , whose elements are infinitary sequences of elements of . These infinitary sequences are therefore, in a certain sense, searchable; which is somewhat surprising. This development is particularly interesting in the context of constructive real numbers, as the computable elements of compact intervals of can be represented as infinitary sequences of digits taken from a finite set . In this work, a constructive Tychonoff-style theorem is utilised to search these representation spaces of constructive real numbers relative to certain explicit continuity conditions.

Contributions

Our paper establishes new connections between several areas: global optimisation, regression, searcheable types, and constructive real numbers. This is the most important contribution.

Our paper also makes a technical contribution to the study of searcheable types by adding an explicit requirement of continuity to the key theorems which, which allows us to formulate our key proofs in a way that is compatible with proof assistants, namely Agda, based on constructive type theory. This means that the entirety of our proofs are fully formalised.

Another significant contribution is a more general methodological perspective on global optimisation and especially regression. In fact, the bulk of our paper is spent on regression, as formulated in our type-theoretic framework for searchability. The advantage of the type theoretic framework is that we can generalise the formulation of convergence of global search from to the more general concept of searching on -types, our own version of searchable types.

Our first result is straightforward (Thm. 2

); that regression can be formulated as a global minimisation property, which has a deterministic, optimal-within-epsilon solution. However, we note that this is not an actual convergence property for regression, in the sense that the Weierstrass theorem follows by interpolation (see 

[22] for an informal survey of this issue). Regression, unlike interpolation, relies on a prior assumption for the model which, if wrong, will not converge no matter how precisely we calculate its parameters. So Thm. 2 only states that a solution converges on a ‘best guess’.

Thus, what we give is a theorem which states, in a general setting, what it means for a regression algorithm to converge absolutely. We distinguish between ‘perfect’ models, which are the same as the function we aim to model (the ‘oracle’), provided some parameters are given the right values, and ‘imperfect’ models in which that is not the case. One of the challenges here is to formulate the right notions of approximation between models, not just between parameters. The requisite functions, namely a loss function between models and a distortion function from models to models, are higher order. The abstract type-theoretic setting is essential here in formulating the right notions of continuity which make the theorems true.

The most general version of convergence are Thm. 4 and Thm. 5, which characterise the convergence of regression for an ‘imperfect’ model. Informally, the former says that for whenever the imperfect model and the oracle are ‘approximately equal’ the parameters of the model can be computed so that the error between the model and the oracle is approximately the same as the error introduced by the distortion function. The latter says that if the loss between an distorted oracle and the oracle is less than some then so is the loss between the regressed model and the distorted oracle. Both theorems capture the same idea: the error introduced by a ‘bad guess’ of a model bounds the error between the regressed model and the oracle. As an immediate consequence (Thm. 3) if the model is perfect (i.e. the distortion function is the identity) then the loss between the oracle and the model converges on zero.

We give some examples, mainly to show that the definitions we provide (-types and continuity) can accommodate standard examples.

The framework that we have built for this perspective is formalised in the Agda programming language, which allows us to give computable (but practically inefficient) algorithms for our version of optimisation.

2 Technical preliminaries

2.1 Formal proofs

To maintain a high assurance for correctness all our main results and most of our examples are proved formally using Agda [4]. The proofs can be found online111https://github.com/tnttodda/RegressionInTypeArxiv. We use certain options to ensure a high standard of consistency and compatibility. The ‘safe’ option of Agda disables features that may lead to possible inconsistencies, such as type-in-type or experimental or exotic constructs. This option also prevents the local disabling of termination checking. It is our explicit requirement of continuity conditions that allows all proofs to go through without violating termination, unlike prior proofs in the literature [13]. We also turn off the K axiom to ensure compatibility with type theories that are incompatible with ‘uniqueness of identity’ proofs, such as homotopy type theory. Finally, using the ‘exact split’ clause we force the type-checker require that all clauses in a definition hold as definitional equalities. Our proofs requires several basic types and related properties found in Escardó’s TypeTopology library222https://github.com/martinescardo/TypeTopology.

The bulk of the proofs of this section are in the SearchableTypes module, which contains annotations cross-referenced against this text. To make the presentation accessible to readers without a background in Agda the mathematical statements in our paper are formulated in a conventional, informal yet rigorous, mathematical vernacular. To aid the readers who are interested in formal proof details each mathematical proof is labelled with the Agda function formalising it.

2.2 -types

This section concerns the definition and properties of ‘-types’, which are used to develop the concept of Escardó’s searchable types. These types define the spaces in which regression can take place.

Definition 1 (SearchableTypes.ST-Type).

An -type is defined inductively as a finite non-empty type, the product of two -types , or the type of functions , where is an -type.

The key technical challenge of our approach is to define a notion of (uniform) continuity for -types, where continuity of a function is broadly understood as ‘finite amounts of output only require finite amounts of input’. In this context, whenever we deal with infinite data the precision of our observation comes into play. In the case of -types, infinite data comes from the type with shape . It is natural to think of such data as sequences, which leads to a natural notion of precision-up-to- as observing the first elements of the sequence. This notion of equivalence induces the usual ultrametric on such sequences, from which we can derive a reliable definition of uniform continuity.

We generalise this intuitive notion of precision to -types as follows. First, the way we measure precision depends on the type at which we measure it; we call the type of precisions for a given type its exactness type (the elements of this type are precisions). For finite data we do not afford degrees of precision, so its exactness type is the unit type. For product types we take the product of the two exactness types point-wise. Finally, for functions from we record a natural number, which is the precision at that level, paired with the precision in the domain of the function.

Definition 2 (SearchableTypes.ST-Moduli).

The exactness type of an -type is defined inductively as:

  • The exactness type of a finite set is the unit type.

  • The exactness type of a finite product of -types is the product of their exactness types.

  • The exactness type of a function , where is a -type, is the product of with the exactness type of .

Precision as defined above can be used to qualify equality between elements of -types. For finite data equality is not qualified by precision, and for products it is taken component-wise. For sequences , equality with precision , where has the exactness type of , is interpreted as observing only the first elements of the sequence, with each element observed up to precision .

Definition 3 (SearchableTypes.ST-).
  • Two elements of a finite set are said to be equal with precision just if they are equal, for any .

  • Two elements of a product of -types are equal with precision , if their th projections are equal with precision .

  • Two elements of , with an -type, are equal with precision if all elements in the -size prefixes are equal with precision .

Note that in the definition above the types of the precision depends on the -type as spelled out in Def. 2. If are equal with precision , we write .

The concept of ‘equality with precision ’ can be adapted to predicates as logical equivalence with precision () in the obvious way (formally, SearchableTypes.ST-).

The following properties are immediate.

Proposition 1 (SearchableTypes.ST--EquivRel).

Equality with precision is an equivalence relation.

So, immediately, equality implies equality with precision , for any .

We are now in a position to introduce continuity for predicates on -types. A predicate is said to be continuous if its argument only needs to be examined up to precision in order to yield an answer. Obviously, the type of is the exactness type of the argument.

Definition 4 (SearchableTypes.continuous).

We say that a predicate on an -type is continuous if there exists a precision in the exactness type of , such that for all , whenever we also have .

We call the modulus of continuity (MoC) of .

The same intuition applies to functions.

Definition 5 (SearchableTypes.continuous).

A function , with -types, is said to be continuous if for any precision in the exactness type of there exists a precision in the exactness type of such that for all if then .

We call the MoC of for .

Note that types of shape with -types are not themselves -types, but they are an important class of types, which we shall call -types (oracle types, usually ranged over by the variable ).

Certain helpful properties of continuity are immediate:

Proposition 2 (SearchableTypes.all--preds-continuous).

All predicates and functions on finite types are continuous.

Proposition 3 (SearchableTypes.-continuous).

If and are continuous then so is .

We are now ready to introduce the concept of searchability.

A predicate is said to be detachable if it is always decidable, i.e. either it or its negation holds. Note that a detachable predicate is essentially a function to a two-element type, i.e. Booleans.

Definition 6 (SearchableTypes.searcher).

A searcher  on an -type is a function which given a detachable and continuous predicate on returns a witness element of , for which the predicate holds if such an element exists.

Since the searcher is a well-defined function, it will always return an element of even if a witness, i.e. an element satisfying the predicate, does not exist. In that case the searcher will just return some arbitrary element of .

Remark 1.

In the Agda code the definition above has two parts, also involving SearchableTypes.
search-condition
, which spells out what it means for a witness to satisfy the predicate.

We will usually denote a searcher by .

Definition 7 (SearchableTypes.continuous-searcher).

A searcher  on is said to be continuous if whenever given predicates which are equivalent with precision , it returns witnesses which are equal with precision , .

An -type is said to be continuously searchable if any continuous and detachable predicate on it has a continuous searcher.

We are now building towards the main theorem of this section, that all -types are in fact continuously searchable.

Lemma 1 (SearchableTypes.finite-ST-searchable).

All finite non-empty types are continuously searchable.

Proof.

In the case of finite (non-empty) types we use induction on the size of the type. For singletons the proof is immediate, with the searcher always returning the unique element. The continuity of this searcher and the fact that it is a proper searcher are immediate. In the inductive case, given a searcher for a set of size and some predicate we construct a new searcher for the finite type which behaves like the old searcher if it finds a witness and returns the additional element otherwise:

Checking that this is a continuous searcher is laborious but routine. ∎

Lemma 2 (SearchableTypes.product-ST-searchable).

The product of two continuously searchable -types is continuously searchable.

Proof.

In the case of the product of two -types which are searchable, we need to construct a searcher for predicate which returns as pair a witness . Let and be the searchers for the two types. The witnesses are computed by:

These computations are obviously continuous, and the formal proof is straightforward. Verifying that these values satisfy the conditions of a correct searcher is laborious but routine. ∎

Remark 2.

In the previous two theorems the details of checking that the defined searchers meet the required conditions are intricate, but they are also routine in a way such that our proof-assistant (Agda) can also make the task easy. Because of this, our reliance on a proof assistant is not onerous but, in fact, beneficial, improving the productivity of the mathematics.

The previous two lemmas are perhaps unsurprising, since finite types and binary products can be searched exhaustively and component-wise, respectively. The surprising fact is that the type of infinitary sequences satisfies the same property.

Before we proceed to the main result, we note that

Lemma 3 (SearchableTypes.tail-decrease-mod).

For any natural number , if a predicate over -sequences, with an -type, has modulus of continuity then predicate has modulus of continuity , for any .

Lemma 4 (SearchableTypes.tychonoff).

Sequences of continuously searchable -types are continuously searchable.

Proof.

In this case, we need to construct a searcher for predicate Q which returns a witness . Let be the searcher for . We proceed by induction on the first projection of the modulus of continuity of , :

For , we can return any element as it will vacuously satisfy the predicate. For example, .

For the inductive step we construct the witness like so:

is constructed using the inductive hypothesis: by Lem. 3, the first projection of the MoC of the searched predicate is one less than . It is laborious but routine to show that the two predicates searched here are detachable and continuous.

While the formal proof may look daunting, proving this witness satisfies the predicate is intuitively straightforward. Verifying that the constructed searcher is continuous is somewhat complex, but follows from the continuity of by induction on the modulus of continuity of the predicates involved. ∎

From Lem. 1-4 the key result of this section follows immediately:

Theorem 1 (SearchableTypes.all-ST-searchable).

All -types are continuously searchable.

This theorem is a Tychonoff-style theorem since -types are closely related to compact types, and the definition of -types can be interpreted as all types that can be built from finite types using products, finite or countable. The theorem guarantees that the collection of types that can be used in regression is rich enough to cover many interesting examples.

3 Generalised parametric regression

In this preamble to our main technical results we give a semi-formal presentation of the key ideas to aid understanding and explain the method we are following.

Consider the most common form of regression, linear regression. It involves a ‘model’ defined as , with . The regression task involves computing the parameters such that a measure of loss, or error, between and a data set is minimised. A common, but not unique, formula for such a loss function is the ‘least squares’, defined as

This is essentially an optimisation problem: finding the best to minimise the function above.

Note that the regression problem has an identical formulation for polynomial regression, where the model is a polynomial of some fixed rank , except that the problem now is finding some . We work towards generalising the concepts, offering the following informal definitions first:

Definition 8.

We say that an oracle is a continuous function of type .

We say that a parameterised model is a continuous function of type .

We define a loss function as any continuous function of type such that , for any .

These definitions are still informal in the sense that we are not saying anything yet about what the s are. The obvious candidates for such types are computable representations of (compact subsets of) real numbers. However, as we shall see, any -types can be used, which leads to a generalisation of existing notions of regression.

Note that a loss function is a generalisation of a metric, dropping the requirement for it be sub-additive and even symmetric. It is convenient, without loss of generality, to normalise it to the unit interval, which will be represented as a specific -type.

For readability we may write the instantiation of a model for a given parameter as and the loss function in curried form, so that the quantity to minimise is written as .

Our perspective on regression, succinctly expressed, is the following:
The regression problem consists of finding a parameter such that for a given oracle and model , the value of the loss function is minimised.
For instance, in the case of linear regression we may (naively) take and ; for polynomial regression for some fixed and the type of the oracle as before. The loss function, least squares (or rather a normalised version thereof), is in .

However, the reals cannot be represented as an -type. In the sequel we see how to work with computable representations of certain (compact) subsets of which are -types and lead to interesting examples, according to our motivation discussed earlier.

3.1 Real numbers and their representations

Remark 3.

Before we proceed we need to make some important distinctions. The real numbers are a well understood mathematical concept. In our formal perspective we are required to work with a representation, or an encoding, of the real numbers into entities that can be defined type-theoretically. This leads to a foundational tension between the mathematical concepts and their formal representations. The most significant potential problem arises from the fact that mathematical functions operate on real numbers, whereas our functions work on encodings of real numbers (codes). If a function defined in our representational domain corresponds to a genuine mathematical function it is called its realiser

. However, we can define functions on codes which are more ‘intensional’ in nature than mathematical functions because they have access to the internal representation of the numbers in a way that mathematical functions do not. Such functions are not realisers of any genuine mathematical function. Yet, such functions are interesting from the point of view of computer science, data science, or machine learning insofar as we see these disciplines are intrinsically algorithmic rather than purely mathematical, thus restricted to operating on codes. Thus, resolving this foundational tension by ensuring that all ‘representational’ functions are genuine realisers is not something that we are concerned with in this paper, although it is an important and well-studied topic in computable real number arithmetic 

[27].

As motivated by the considerations above and our leading target examples we need to consider now real numbers. In our constructive setting we clearly need to restrict ourselves to representations of some ‘computable’ reals. More precisely, we require representations of the reals for which our desired operations (at least comparison, addition and multiplication) can be defined and are continuous.

Real numbers are used in two ways: in the general setting, as part of defining the concept of ‘loss function’, and in examples. Because of this distinction we can conveniently use several types which serve different purposes. For the loss function we can represented the unit interval as binary sequences , which is clearly an -type. For this type we can define families of strict and total order relations, each of which is detachable and continuous. Each element is an encoding of a real number in ; we notate the encoding of as . The interpretation is the standard one for binary numbers: .

Definition 9 (UIOrder.).

For any , a sequence is said to be less-than with precision another sequence , written , if there is some , such that their prefixes up to are equal and .

Definition 10 (UIOrder.).

For any , a sequence is said to be less-than-or-equal-to with precision another sequence , written , if either or .

It is straightforward to prove that, for any , is a strict partial order, is a total order and, given , these predicates are decidable and continuous.

With these considerations in place we can revisit and spell out the informal parts of Def. 8, the general formulation of regression. To cast it in type theory, we will always take to be -types, and we will use the representation of the unit interval as the domain of the loss function. The type of the oracle is thus some , which is an -type.

3.2 Continuity of the loss function

The type of the loss function is , with being -types. This means that the standard definition of function continuity (Def. 5) does not apply. In this section we define a notion of ‘continuity’ for loss functions.

First we introduce a notion of approximate equality for functions.

Definition 11 (TheoremsBase.ST-).

Two functions with being -types are said to be equal with precision in the exactness type of , written if for all we have that .

This is an extensional definition in which all points in the domain are evaluated, but the results are compared only with precision , which needs to be of the exactness type of .

With this, we can define a weaker notion of continuity for model functions.

Definition 12 (TheoremsBase.continuous).

A model function where is an -type and and -type is said to be weakly continuous if for all precisions in the exactness type of there exists a precision in the exactness type of such that for all , if then .

Note that above has the exactness type for and for , respectively.

It is straightforward to show that

Lemma 5 (TheoremsBase.strongweak-continuity).

Any (model) function that is continuous is also weakly continuous.

With this, we can define (weak) continuity for the loss function.

Definition 13 (TheoremsBase.continuous).

A loss function , where is an -type, is said to be (weakly) continuous if for any precision in the exactness type of , there exists a precision in the exactness type of such that if for all if then for all we have that .

We call the MoC of for precision .

The definition above can be generalised so that it is continuous in both arguments. However, only this more restricted continuity of the loss function is required by the theorems below.

3.3 Global optimisation and the convergence of regression

We now turn our attention to a general characterisation of algorithms for regression: in what circumstances they exist and what it means for them to be correct. The standard property of regression is that a ‘best guess” parameter can always be produced.

Theorem 2.

Let be an -type and be an -type and a precision in the exactness type of . For any weakly continuous model , oracle , and continuous loss function we can construct a parameter such that for any we have that .

Proof.

We prove this as a corollary to the more general theorem that any continuous function has a minimum argument such that . The corollary follows because – due to the continuity conditions on and – the function is continuous.

We use induction on the structure of as an -type. In each case we wish to construct the argmin for with precision , notated .

In the finite case, we proceed by induction on the number of constructions of . If , the unit type with the single construction , then clearly . If for some -type , then we proceed by inductively computing , where casts the element to the corresponding element in . As is the argmin for in with precision , and is the corresponding argmin in , we simply need to decide whether or – where . This is decidable because is decidable and a total order by Def 10.

In the product case , we proceed similarly to the structure of Thm. 2.2. We construct as follows:

From these inductive constructions, we have that and . By transitivity of (Def 10), therefore, .

In the sequence case , we proceed similarly to the above and by the structure of Lemma 4, i.e. by induction on the first projection of the MoC of at point . When the case is vacuous. In the inductive step, we construct as follows:

is constructed by the inductive hypothesis on the MoC, because the MoC of at point for a given value will be one lower than that of . Therefore, we have that and ; again, the result is obtained via the transitivity of . An additional lemma is used to finish this case that shows the output of a continuous function is equal to the required precision to , where and . Thus, .

This theorem seems to give a definitive constructive, type-theoretic, characterisation of regression. However, the computational content of the proof is, on closer inspection, not satisfactory. We can understand that more easily by instantiating the theorem on particular types, such as . Informally speaking, the proof requires finding the argmin of the function with some fixed precision. The way in which is computed is by partitioning the interval into a finite number of intervals computed from the precision. The continuity condition of will allow us to compute a size of these intervals which is small enough so that their images through is smaller than the precision. In other words, for the given precision we do not need more than a certain precision of the input. And, since there is a finite number of partitions, we can simply examine the value of for all of them and select the one in which this value is minimal.

There are two inter-related problems here. The first one is obvious, the algorithm that is extracted out of the proof is an always-exhaustive search of the domain, up to the desired level of precision. The second one is more subtle and it has to do with the ‘stability’ of the algorithm. Suppose that there are two distinct values and for which and which is minimal for . In this situation, as we run the argument with different precisions sometimes we may get an approximation around as a result and sometimes we may get one around . As gets smaller the algorithm is not guaranteed to converge on either of them.

The misbehaviour is not entirely surprising considering that we are attempting to compute a function, argmin, which is known not to be computable [31]. The reason we manage to compute anything at all is because our algorithm has access to the codes of the numbers involved, so it is a function which is not a realiser of any mathematical function (also see Remark 3).

3.4 Regressing a perfect model

The regression Thm. 2 gives a conventional characterisation of regression, but it has certain shortcomings, as discussed. It also does not tell the whole story. Whereas it states the situation in which the loss value can be minimised it makes no absolute statement regarding the loss itself. We therefore desire a statement which says something about the situation in which the error can not only be minimised, but also be made vanishingly small. In other words, a convergence theorem guaranteeing that the regressed model is arbitrarily close, as measured by the loss function, to the oracle.

In parametric regression we are epistemologically committed to a model, we just do not know its parameters and we want to calculate them from observations. The minimisation algorithm in Thm 2 is always guaranteed to produce a ‘best guess’ in terms of minimising loss, but if our bet on a particular model is the correct one then this ‘best guess’ should be such that the the loss can be made vanishingly small. To represent this situation, instead of taking an arbitrary oracle we take an arbitrary parameter and create a synthetic oracle . The synthetic oracle has the ‘same shape’ as the model, therefore can be approximated with arbitrarily small loss.

For this theorem we will rely on the concept of searchability, which did not come into play in the minimisation theorem Thm 2. We will call a regerssion algorithm a regressor.

Theorem 3 (LossTheorems.perfect-theorem).

Let be an -type, an -type, a precision, a loss value such that , and a continuous loss function.

There exists a regressor such that given an element , and weakly continuous model , we can construct such that , for synthetic oracle .

Proof.

This theorem is an immediate corollary of the more general Thm. 5 in the next sub-section. ∎

3.5 Regressing an imperfect model

Thm. 2 states that parametric regression eventually converges on the ‘best possible’ solution, whereas Thm. 3 proves that if we ‘guess’ the model correctly then the regression converges on the ‘absolutely best’ solution. But what if we don’t guess the right model? Consider the following data which is produced by oracle in Fig. 1.

Figure 1: Regression to imperfect model

Parametric regression requires us to commit to a model, and the model can be imperfect. For instance, trying to regress a linear model for the oracle could give a ‘pretty good’ approximation, depending on the desired precision. We will aim to quantify this using another convergence theorem which essentially says that the better the guessed model the higher the precision of the approximation.

To formulate the theorem we will again use a synthetic oracle , with unknown parameter , but we will distort it using a function , so that the regression will try to reconstruct by wrongly assuming it is . The distortion function can represent either measurement noise or a lack of perfect knowledge about the oracle. To quantify this lack of knowledge, or how powerful the distortion is, we use two approaches.

The first theorem for regressing an unreliable model uses equality with precision to compare how ‘equal’ the original and the distorted oracle are, and show that the loss function between the correct and the distorted oracle is ‘just as equal’ (with precision ) with the loss function between the correct and the regressed oracle. It utilises the following definition of a ‘continuous’ distortion function:

Definition 14 (FunEquivTheorem.continuous).

A distortion function , for a given -type , is called continuous if for any function and precision in the exactness type of , there exists some precision in the exactness type of such that for any if then .

Theorem 4 (FunEquivTheorem.imperfect-corollary-with-).

Let be an -type, an -type, a precision in the exactness type of , and a continuous loss function. Given an element , continuous model , and any continuous distortion function , there exists a regressor such that whenever :

where is the synthetic oracle, the distorted synthetic oracle, and is the MoC of the loss function for precision .

Proof.

is an -type, therefore it comes equipped with a searcher . The regressor which computes the parameter is . It turns out that, due to the searchability of the -type and the continuity conditions on the model and distortion functions, this predicate is in fact detachable and continuous.

Because there exists some such that , we have by the condition on the searcher that where . By transitivity of (Prop. 1), we arrive at .

Finally, a routine calculation from the continuity of the loss function gives us the result. ∎

This theorem gives a convergence property of sorts, but it is not very useful in practice. It only applies when the distortion produced by is small enough for the original and distorted oracle to be ‘almost equal’ (with precision ). This means that if the distorted model differs from the true model even rarely, but by a large enough amount, the theorem does not apply.

For this reason we also give a more practically relevant convergence theorem which uses the loss function itself to measure the degree of distortion, rather than approximate equality, and only requires a weakly continuous model. The second imperfect-model regression theorem states that if the loss between the distorted synthetic oracle and the true oracle is small, then so is the loss between the distorted synthetic oracle and the regressed model. To emphasise, this is even though the model is regressed using the distorted oracle as a source of data.

Theorem 5 (LossTheorems.imperfect-theorem-with-).

Let be an -type, an -type, a precision, a loss value, and a continuous loss function.

There exists a regressor such that given an element , a weakly continuous model , and a distortion function , for parameter :

for synthetic oracle and distorted synthetic oracle .

Proof.

The proof follows the same ‘recipe’ as that of Thm. 4, effectively constructing a regressor which has the desired property.

The regressor will use the searcher for the searchable type for the predicate to produce the model parameter . We need to show that this predicate is continuous, detachable, and satisfies the desired property, which follows from routine calculations. ∎

It is easy to see now that the perfect-model convergence theorem (Thm. 3) is an immediate consequence of the imperfect-model convergence theorem (Thm. 5), by using the identity distortion , which then makes so that the condition is trivially true.

We prefer this final formulation of the theorem, in contrast to the previous one, and we will take it as the defining property of regression, rather than the conventional minimisation one expressed in Thm. 2.

Compared to the global minimisation approach, Thm. 5 has the potential to serve as a basis for more efficient algorithms. This is because the regressor uses a searcher, which does not need to explore the search space exhaustively, unlike Thm. 2. The searcher can stop and return the parameter as soon as the predicate is satisfied. In other words, it will provide a ‘good enough’, up to the specified target loss value, solution instead of searching for the ‘best’ solution. The ‘worst case’ behaviour of exploring the entire space can still happen, especially if there is no witness to the predicate.

We also need to understand that the regressor is guaranteed to return a good enough parameter only when our model is a good enough guess of the oracle. If our model is bad then the regressed parameter will not be very good either. This is a problem in practical application, since we may not know what the true model is. That means we cannot know whether . Therefore, for the computed parameter , we need to compute separately whether . Fortunately, the latter is computable — that could be considered a separate ‘validation of regression’ step. This matches accepted practice in machine learning and data science where ‘learning’ or ‘inference’ is always followed by ‘validation’ or ‘testing’. What the Thm. 5 guarantees is that the regression algorithm is valid, in the sense that good models will always be inferred accurately.

The imperfect-model regression theorem also saves us from relying too much on our small methodological innovation as discussed in the Introduction. Regression as broadly practised is ‘from data’ and not ‘from oracle’. In other words it is ‘off-line’ rather than ‘on-line’, with all data pre-sampled in advance. But we can think of off-line regression as regression to an imperfect model, with the distortion function formed by the composition of a sampling function followed by an interpolation function, noting that interpolation can be easily defined as continuous. Thm. 5 guarantees that if the reconstruction via sampling and interpolation is ‘almost perfect’ then so is the regressed model. What is left unsaid is that it is indeed possible to reconstruct a function via sampling and interpolation with arbitrary precision. In other words, that the Stone–Weierstrass theorem can be recast in this setting. This is subject of further research.

4 Examples and applications

The framework described above is rather abstract. In this section we will show that it is applicable in a common scenario in which regression is used: polynomial regression with a loss function in the style of least-squares. As a warm-up example we will also show a ‘degenerate’ form of regression, which is simply searching for the argmin of a function. This example is interesting because it gives a deterministic version of the well-known random search theorem [29]. Finally, we show and discuss the practical implications of regression to a model described by an infinite Taylor series, which is normally outside the scope of existing regression methods.

4.1 Real number arithmetic

For examples we focus on the interval which is represented by the type of ternary sequences , a version of the ‘signed digit representation’ [2]. Sequences are encodings of real numbers in , using the standard binary numeral interpretation . This representation is particularly well suited for the definition of multiplication and normalised addition (taking the midpoint), but is inconvenient for defining an order as the same number can have too many encodings. In contrast, is suitable for ordering but not for arithmetic. This highlights the convenience of being able to use different representations of the reals for different purposes.

The midpoint algorithm is closely inspired by Ciaffaglione and Di Gianantonio [6], and multiplication by Escardó [11]. Both of these have been proved formally correct in loc. cit. but not in a way that can be easily reused (or recycled) in our setting. However, we face an additional burden of proof by being required to show they are all continuous functions in the specific sense of Sec. 2.2. This is what we focus on.

Practical applications may require operating with representations of larger sets of reals than just . Arbitrary closed intervals can be obtained from using scaling and shifting by constant values, which introduce some not insurmountable complications. To deal with larger sets of reals still we need to be always careful that the representation is an -type. For instance, a ‘mantissa and exponent’ representation, where the mantissa is a representation of a real and the exponent a natural number is not an -type. A good rule of thumb is that compact sets are good candidates which might have such representation. We leave these issues for further work.

The operations below are a minimum set which will allow us to formulate examples. The implementations are meant to be easy to reason about rather than efficient – they are in fact not practically usable. To scale up to realistic regression examples as used for example in machine learning the operations need to be much more efficiently implemented and, perhaps, extracted out of Agda into a more performance-oriented language. However, there is no reason to believe that the recipe we follow below cannot be applied to more, and more efficiently implemented, operations.

Midpoint

(Details in module IAddition)

Let be a sequence with head and tail . Let be the type of integers and addition on integers. We write . Following loc. cit. we define the midpoint operator using auxiliary operations , and .

The midpoint operator can now be defined as

Full blown addition can be defined using and a global scaling factor via elementary algebraic manipulations. For example, if then we can define with a global scaling factor of 4.

Lemma 6.

The operator is continuous.

Proof.

This is so because for any , , , , and , if then . ∎

Negation

In the signed-digit representation this operation is simply reversing the sign of each digit. The continuity is immediate.

Multiplication

(Details in module IMultiplication)

Let be multiplication on the set of digits defined in the obvious way. And let be the multiplication of a (code of a) real by a digit, defined by mapping over the sequence of digits which is an element of . We use several auxiliary operations defined by mutual recursion.

First consider :

The second helper function is :

where , .

Finally, multiplication is defined as

Lemma 7.

Multiplication is continuous.

Proof.

This amounts to proving that the constituent operators , , and are continuous. The question is whether at every precision in the exactness type of there exists some MoC in the exactness type of . In the cases where the output is a simple arithmetic operation that relies upon zero or one digits of the input – for example, the case of – the MoC is clear and easily constructed. In all other cases, the output is the result of the composition of the operator with other operators that have been proved continuous. As is continuous, it is clear that an MoC can be constructed in these cases too. The most difficult case is the ‘otherwise’ case of , which relies upon constructing an MoC from the continuity of , and itself. However, as the value decreases, we can construct the MoC from an inductive hypothesis on the continuity of at differing values of .

The formalisation seems forbidding but the intuition is clear. ∎

Positive truncation

The domain of the normalised loss function is , whereas arithmetic happens in , for example in computing least-square-like loss functions. Since we only require continuity and the vanishing property of the loss function, rather than a precise measure of loss, the simplest way to create a well-typed loss function is to use a ‘truncation’ function which changes all digits to 0 and keeps the rest of the digits. This operation preserves continuity and the key property of the loss function, to be vanishing ().

Coming back to our discussion in Remark 3, the truncation is a perfect example of a function that operates strictly at the level of codes and is not the realiser of a real function . This is somewhat unsatisfactory from a foundational perspective, but from an algorithmic (and somewhat pragmatic) point of view it raises not serious issues in our setting. More meaningful loss functions, which are realisers of real functions and have additional desirable properties (e.g. they are monotonic) can be defined, but at the cost of extra complexity.
The operations above, together with Prop. 3 which states that continuity is preserved by composition, will allow us to construct arbitrary multi-variate polynomial functions. Going beyond that would require extra operations (division, square root, logarithm, trigonometric functions etc.), for which algorithms in real number computation exists, but are beyond the scope of the present work.

4.2 Search as degenerate regression

Using the minimisation algorithm for regression (Thm. 2) we can compute, to an arbitrary precision, the solution of any continuous function simply by considering the variable(s) as an unknown degenerate model parameter and least squares as the loss function.

Concretely, let us illustrate this with solving a non-linear system of equations:

This equation is expressed in terms of real numbers, and we can use the minimisation algorithm of Thm. 2 to look for approximate solutions in with some given precision. As it happens, the solutions to this equations are both in ( and ).

In the notation of Thm. 2, the parameter type , which is an -type, and , with the unit type, which is an -type. This is why we call this ‘degenerate’ regression, because the oracle type is not a function type.

The model ‘function’ is now a constant:

The true (degenerate) oracle is the constant . The loss function is , ).

Since all the types involved are -types and all the functions continuous, as the composition of continuous operations, it is an immediate consequence that the ‘parameter’ can be computed for whatever precision . The minimiser used in the theorem is a possible such algorithm.

Two caveats are required. The first one is that regression will compute the ‘argmin’ of the function, so it will return one of the solutions if they exist. This has been already discussed in the general setting in Remark 3. In this example both real solutions are in . The algorithm does not control which one will be returned. The second one is that in the case of no solution the minimisation algorithm will still return some value of the argmin, so the model itself must be used to test whether the loss value is close enough to zero to be considered a solution. Whether a returned pair is an actual solution, i.e. is, of course, not decidable because equality is not decidable in .

4.3 Polynomial regression

This is the ‘meat and potatoes’ motivating example. Consider a set of points , . And suppose that we want to ‘best fit’ a polynomial through this data set, i.e. find values for which minimise a loss function such as least squares.

One apparent obstacle is that all the convergence theorems require an oracle to compute the parameter , whereas we only have a set of points. An important observation is that the least square loss function computes the loss only at the given data-points and ignores its behaviour elsewhere. So any ‘oracle’ constructed from the points which is continuous would ultimately lead to the same result.

To construct such an oracle we can use interpolation. There are many interpolation algorithms but for our purpose we might as well take the simplest one: piece-wise constant interpolation.

Let be some fixed precision and an arbitrary value. We define a (distorted) oracle from the data points ():

The definition assumes that the data points are sorted by the component. This function is defined by cases noting that the order in which the conditions are tested is fixed, top-to-bottom. This makes the function well defined, computable, and, perhaps surprisingly, continuous. (In fact Thm. 5 does not require the oracle to be continuous, just Thm. 4.) The real issue is not continuity but why the function is well defined. The function is defined piecemeal, but if is closer to some than the precision then we cannot say for sure if it is to the left or to the right of it. In this situation the fact that the side-conditions are checked in a defined order means will be considered as if it is to the left of the , which makes the function well defined. Note that this means that this function is also not a realiser of a continuous function , an issue which we discussed before (Remark 3).

The general property of regression (Thm. 2) guarantees that parameters can be computed so that the interpolated model will minimise the least-square error at each . It is interesting to also consider what this means from the point of view of convergence. The perfect-model convergence theorem (Thm. 3) is not applicable since the general form of the oracle (line segments) and of the model (polynomial) are not the same.

However, the general imperfect-model convergence theorem (Thm. 5) says that if the loss between a distorted model and the true model vanishes then so does the loss between the true model and the regressed model. In this case the true oracle would be a same (or lower) degree polynomial from which the data points are sampled then interpolated, resulting in the distorted oracle. Since the least squares loss function only considers the behaviour at the sample points, it will be zero when applied to the true and distorted oracles. Which means that Thm. 5 guarantees that in this situation the loss between the true oracle and the model can be also made arbitrarily small. From which we can conclude that polynomial regression, as performed in practice, has good convergence properties.

The possibly problematic aspect of this is not the use of a polynomial as a model but the fact that we are working ‘offline’ (from data) as opposed to ‘online’ (from the oracle). But the correctness of ‘offline’ regression is an immediate corollary of Thm. 5

Proposition 4 (Examples.offline-regression).

Let be an -type, , a precision, a loss value, points , , a least-squares loss function, and a constant interpolation function, both defined at points .

There exists a regressor such that given a weakly continuous model for parameter :

for synthetic oracle .

From this, the convergence of polynomial interpolation follows immediately as any model defined by a polynomial is continuous.

4.4 Universal approximators

In applications, particularly to machine learning, we may not know the general form of the oracle. In such a situation we may want to consider a more general kind of model, which is expressed as an infinite series, such as power series or trigonometric series. Many such series can be written in the form where is a fixed function and an infinite set of parameters. For example, in the case of a power series . Such series can serve as ‘universal approximators’ for classes of functions. For example, analytic functions equal to their Taylor series at all points form a class known as ‘integral’ functions. The polynomials, exponential, and certain trigonometric (sine and cosine) functions are examples of integral functions.

These models are intriguing because they can be given types such as , with the type of parameters an -type. This means that, providing the continuity of is proved, the entire set of parameters can be computed to any degree of precision. In the case of the power series we know that using addition, multiplication, and composition always leads to continuous functions. The problem is computing the infinite series. Provided that the series converges, the series can be computed in general [19] or approximated [7]

, but this is beyond the scope of our paper. From the point of view of regression analysis this may seem surprising, but this is a known result using searchable sets 

[12].

For example consider a model , which converges for all values of . It can be used to regress some oracle . Using the regressor of Thm. 3, the sequence of parameters is given by , so a model can be instantiated as .

Note that the solution above involves an infinite set of parameters so obviously cannot be computed other than lazily. The model, after instantiating is

which is computable but could be expensive to compute.

A problem of practical importance in this setting is the ‘truncation’ of the series defining model to only a finite number of terms, i.e. . However, such a model has type . The type is not an -type; it is also clearly not a searchable type. So this problem cannot be solved.

A broader consequence is that some ‘hyper-parameters’ of neural networks (the number of layers, the number of neurons per layer, etc.) also cannot be computed using our approach.

Remark 4.

This class of more speculative examples, in particular summing infinite series, is not formalised in Agda.

5 Related work

This paper has been inspired by and relies extensively on a significant body of work by Escardó, starting with searchable infinite sets [9]. The properties of regression established here can be equally formulated in that setting, or in related setting such as compact sets [12] and compact types [13]. What makes our approach distinct is that in the formulations above are not synthetic, in the sense of [8]. Whereas in synthetic topology all functions are assumed to be continuous, we work with an explicit condition of continuity. This makes proofs more difficult but it has the advantage that makes our regression theorems hold in more models of type theory, including those that manipulate non-continuous functions, yet allowing for formalisation in a proof assistant based on dependent type theory (namely Agda).

We are interested in establishing an alternative framework for a better mathematical understanding of data science, machine learning, etc. based on type theory and constructive real numbers. It is worth drawing an anaology it with the established mathematical framework for machine learning, probably approximately correct learning (PAC) [32, 15]. We first introduce its basic concepts.

Let be a set and an unknown function (in our terminology, the ‘oracle’). A sample is drawn from according to some (unknown) distribution

and is correctly classified according to

. Can we learn the function ? Note that this is a particular instance of regression as discussed here.

The function is not usually guessed out of nothing, but from a known class of possible functions , dubbed inductive bias. Our counterpart is, of course, the class of models . The working assumption is that , which is mirrored in our approach, in the convergence theorems, by the fact that for some unknown .

Suppose that a learning procedure (which we call a ‘regressor’) produces a new hypothesis based on the sample. This is what we call a ‘regressed model’ . The basic question is how good is this new hypothesis? It should be good for the sample, but also for new examples.

The error is defined as , the probability that under the given distributions the unknown function and the hypothesis differ. The problem statement is that given an error what can it be said about . This cannot be guaranteed, except with probability at least for some fixed parameter .

A hypothesis class is PAC-learnable if there is an algorithm such that for every and unknown function there is a natural number such that when running the algorithm on an sample of independent examples according to some distribution , we obtain an such that with probability at least .

The size of the sample given as a function of and is called the sample complexity. Finite sets are a non-surprising example of PAC-learnable and their sample complexity bounds are known. But certain infinite sets also are PAC-learnable, with sample complexity determined by the so-called Vapnik–Chervonenkis dimension (VC) [33].

Our approach is complementary to PAC, having certain strengths and weakness (leaving aside the obvious fact that PAC theory is a mature and well explored area of research). The setting of the problem is similar, up to differences in vocabulary, but both the learning procedure and the validation procedure vary significantly. PAC requires a prior sampling of the oracle with a given distribution, which makes it intrinsically ‘off-line’, whereas our learning procedure assumes access to the oracle, ‘on-line’. (The two are related by Prop. 4, but more about this in the next section.) It also means that the learning procedure and the testing criterion in PAC are necessarily probabilistic. In contrast, our approach is deterministic and quantitative in a different way: instead of measuring the probability of the learned outcome being different from the desired outcome we measure the definite amount by which the two outcomes differ. For finite sets, which can be searched trivially, our approach is trivial whereas the PAC is interesting. But for infinite sets both our approach and the PAC approach give interesting and non-trivial characterisations.

6 Conclusions and further work

The main contribution of the paper is to offer a range of convergence criteria for parametric regression, formalised in type theory, and proved formally in Agda. The main convergence theorem (Thm. 5) states that a large class of oracles, all continuous functions of -type with unknown parameters of -type, can be regressed up to any desired precision, even in the presence of distortions, so long as the distortions are small. The regressors used in the theorem can be considered as correct, albeit inefficient, reference implementations that satisfy the conditions of the convergence theorem.

The next part of this work will requires us to turn our attention to off-line learning. The starting point is Prop. 4 which gives a convergence criterion for off-line regression. The interesting part is the precondition . We conjecture that there if the sample is large enough then this precondition is always true. The reason is that the distorted oracle constructed by interpolation should become close enough to the true oracle

as the sample grows, which is a version of the Stone-Weierstrass interpolation theorem. Our simple interpolator (piece-wise constant) may not be suitable for such a theorem, but we strongly believe that such interpolators exist in our setting. Interpolation, as mentioned above, is closely related to sampling, which could open the door to dealing with probabilistic sampling and formulating convergence results more closely related to PAC learning, including estimating or bounding the sample size. The fact that probabilities over discrete sets such as

are -types is encouraging. In the longer term we also wish to find (synthetic) topological or type-theoretic characterisation of other PAC concepts, such as VC dimension.

Interpolation in itself is very important because, especially in the presence of distortions (noise), as it forms the basis of non-parametric regression, the learning of models without committing to a particular shape of a model.

A better class of interpolation functions should also resolve the foundational rough edges discussed in Sec. 4.3, namely the fact that the interpolated functions are not realisers of real functions. We do not believe these issues have any profound consequences but are best avoided. In contrast, the same issues in the context of the minimisation theorem (Thm. 2) cannot be solved — but this theorem is a ‘dead end’ for us.

In parallel we aim to consider more realistic implementations, either extracted from the Agda regressors or implemented directly in other more performance-oriented languages. The key requirement is fast (enough) arbitrary precision arithmetic over real numbers, a field intensely studied with multiple libraries available for various languages [24, 17, 5].

References

  • [1] (2007) 22nd IEEE symposium on logic in computer science (LICS 2007), 10-12 july 2007, wroclaw, poland, proceedings. IEEE Computer Society. External Links: ISBN 0-7695-2908-9 Cited by: 9.
  • [2] A. Avizienis (1961) Signed-digit numbe representations for fast parallel arithmetic. IRE Trans. Electronic Computers 10 (3), pp. 389–400. External Links: Link, Document Cited by: §4.1.
  • [3] S. Berghofer, T. Nipkow, C. Urban, and M. Wenzel (Eds.) (2009) Theorem proving in higher order logics, 22nd international conference, tphols 2009, munich, germany, august 17-20, 2009. proceedings. Lecture Notes in Computer Science, Vol. 5674, Springer. External Links: Document, ISBN 978-3-642-03358-2 Cited by: 4.
  • [4] A. Bove, P. Dybjer, and U. Norell (2009) A brief overview of agda - A functional language with dependent types. See