'Almost Sure' Chaotic Properties of Machine Learning Methods

07/28/2014
by   Nabarun Mondal, et al.
Microsoft
D. E. Shaw & Co., L.P.
0

It has been demonstrated earlier that universal computation is 'almost surely' chaotic. Machine learning is a form of computational fixed point iteration, iterating over the computable function space. We showcase some properties of this iteration, and establish in general that the iteration is 'almost surely' of chaotic nature. This theory explains the observation in the counter intuitive properties of deep learning methods. This paper demonstrates that these properties are going to be universal to any learning method.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

01/10/2020

Chebyshev Inertial Iteration for Accelerating Fixed-Point Iterations

A novel method which is called the Chebyshev inertial iteration for acce...
04/15/2022

A Machine Learning Tutorial for Operational Meteorology, Part I: Traditional Machine Learning

Recently, the use of machine learning in meteorology has increased great...
06/10/2019

Data-driven Reconstruction of Nonlinear Dynamics from Sparse Observation

We present a data-driven model to reconstruct nonlinear dynamics from a ...
02/07/2020

Differentiable Fixed-Point Iteration Layer

Recently, several studies proposed methods to utilize some restricted cl...
08/30/2021

Beyond Value Iteration for Parity Games: Strategy Iteration with Universal Trees

Parity games have witnessed several new quasi-polynomial algorithms sinc...
03/31/2021

Distributed Picard Iteration

The Picard iteration is widely used to find fixed points of locally cont...
02/23/2021

Uniform Elgot Iteration in Foundations

Category theory is famous for its innovative way of thinking of concepts...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Motivation

The motivation of the current paper is two fold.

One of the authors of the current paper was using iterative machine learning to crack cipher code in late 1990s. While doing so he came to the astounding realisation that the resulting learned function is not convergent (definition A.3) at all. What is really meant is that : data points which were accepted up-to iteration ( points in the set ) would fly away eventually after some more iteration at ( with ):

In other words none of the lim sup and lim inf exists and therefore :

In essence the iteration was showing converging and then sudden diverging behaviour. While he communicated this finding to one of pioneers in the image processing domain - he was simply baffled. In fact the trivia is - he simply remarked : “I have no idea - what to respond!”. While this behavior was never fully understood by the author then, recent theory by the same author(s) [11] seems to have an explanation to it.

A very recent (yet to be published) paper in arxiv [12] discusses interesting properties in the deep learning. It became immediately apparent to the authors that these two phenomenons are essentially connected.

This paper tries to explain these phenomenons in the light of the discovery of chaos almost surely happening in computation (theorem A.2) [11].

2. The Nature of Learning Theory

Learning theory is really about classification. As Vapnik [1]

pointed it out - it is all about finding a classifier function

such as to isolate positive samples ( accept ) from negative ones ( reject ), aided by training set

; which is a set of ordered pairs

with and . Given then else , with being the accept set.

2.1. Learned Function as Indicator of Accept Set

We can say that to learn one needs to isolate the set using training data sets . We need to learn the function with :

Clearly then, the function we want to learn, is the indicator function [2] of the accept set , that is:

Then we can simply suggest that machine learning is all about finding the positive samples ( accept ) sets indicator function . This formulation has a problem, because we really do not know that whether or not is computable. The set of computable functions in this paper would be designated as . If we assume existence of a sequence of computable functions which converges [2] to .

This indeed is a very strong axiom - and in the very next section (3) we shall see that this almost surely not holding to be true.

2.2. Learning theory as Iterative Maps

We need to generate this sequence of functions in a computable way. That immediately gives :

the guiding equation of the learning. There has to be a fixed point computable iteration function such that it can take a computable function, returns another one, such that the orbit (definition A.5) eventually converges to “”.

In effect what we seek is :

with is the fixed point of the iteration (definition A.1). All ML and Deep Learning methods can be abstracted in this form, it is in effect a simple tautology.

NOTE however that the function iteration might not be able to reach because in general the set of functions , which would be discussed in the next section.

2.3. The Cut Function Approximation of

It is a decision that whether or not with when and when . Therefore, clearly partitions the input space . The question of mechanism of computable partitioning requires a definition.

Definition 2.1.

Binary Decision Tree.

A binary tree with each nodes having two decider Turing Machine (definition A.11) from a finite set of decider machines such that:

Given with is the input to the current node, let : and . If the input is rejected and the system is halted. If the input is accepted and the system is halted. If then is passed as input to the left child ; if no left child exists, reject input x and halt. If then is passed as input to the right child ; if no right child exists, reject input x and halt.

This structure (definition 2.1) partitions the input space into finite number of equivalent partitions such that:

with either or , but can not be both. We also note that if then . Clearly then is a disjoint union of sets:

3. Probability, Chaos and Learning Iterations

We demonstrate that the general iterative learning system will be ‘almost surely’ [6][7] chaotic [8] [11].

3.1. Properties of the The Learning Iteration

We need some definitions to formally present the ideas discussed before.

Definition 3.1.

Rational Mapping of a Computable Function.

Given an universal turing machine (definition A.13) any computable function can be encoded using the symbols in the tape of the machine. The rationalisation (definition A.16) of the tape then, serves as the rationalisation of the computable function.

where is the set of computable functions. This makes .

Definition 3.2.

The Computable Functional.

A computable function is called a functional iff the range of the function can be interpreted as an encoding of a computable function. That is :

Definition 3.3.

The Learning Iteration.

Given a computable functional , and input (training set) the learning iteration is defined as :

with . Here signifies the rationalisation (definition 3.1) of the .

Theorem 3.1.

Learning Iteration is Almost Surely Chaotic.

Learning iteration, as defined as definition (3.3) is almost surely chaotic.

Proof.

We note : , from (theorem A.1) the theorem on chaos on computation, that almost every rational sequence is chaotic, the result is immediate. ∎

Theorem 3.2.

Almost Sure Non-Computability for converging function.

The iteration of definition (3.3), when converges, almost surely converges to an un-computable function.

Proof.

This is trivial from Real analysis. We know that the sequence , so to complete the space (definition A.4 ) we need to have the embedding space . Clearly then, almost all limit points would be irrational ( actually transcendental ), with a cardinality equal to continuum which is a well known theorem from real analysis [2]. Irrational numbers take infinitely many symbols to encode, therefore, the function limit can not be encoded in a Turing machine tape, and hence, is obviously non computable.

This is precisely what we wanted to show. ∎

3.2. Properties of Self Similarity in Decision Tree

We start with a definition of self similarity.

Definition 3.4.

Self Similarity.

Let there be a topological space (definition A.6) and a set of non-surjective homeomorphic functions (definition A.8) indexed by a finite index set with :

If , we call self-similar if it is the only non-empty subset of such that the equation above holds for. Then is called a self similar structure.

Theorem 3.3.

Infinite Binary Decision Trees have Self Similar Structure.

A binary decision tree as defined in (2.1) is called infinite if it has countably infinite nodes. Accept set of such a tree exhibits self similar structure ( definition 3.4 ).

Proof.

We note that every node in the tree first does a transform of the input space using into ; both homeomorphic to . After that, this space is partitioned using which are finite in number as discussed in section (2). Then this individual partitions are homeomorphic to . Suppose defines the number of partitions. We can then say there are numbers of homeomorphic (on ) non-surjective function available for each . This is due to lemma (A.1).

The set of such functions is finite because and is finite. Suppose then, the set of such composition is termed as . It is then trivial that when (at the root) then :

Therefore, infinite binary decision tree have self similar structure. ∎

In fact it is well known that Cantor Set [9][10][8] is a generalisation of this sort of structure ( a dyadic Tree ).

Theorem 3.4.

Convergent ML Function would have Self Similar Structure.

If an ML iteration (definition 3.3) converging to a function , then accept set of almost surely would have a self similar structure.

Proof.

Almost surely the function is un-computable (theorem 3.2). We note that the due to the fixed size of the learner algorithm, the number of deciders the learning system can stays finite. And therefore, to be convergent the structure is to be extended to infinity : that is an infinite binary decision tree (definition 2.1). Now, using theorem (3.3), the result is immediate. ∎

4. Chaos And Machine Learning - A Summary

The theorems proven in the sections before can be useful to deduce very interesting properties of machine learning, which clearly showcases the problems arising from the chaotic nature of the Universal Computation.

  1. The first phenomenon - that non converging ML process experienced by the co-author (section 1) is clearly explained by the inherent chaotic (theorem 3.1) nature of the ML iteration. The functions, encoded in the rational space, almost surely has a chaotic orbit, and that is why output of function can drastically differ from , given same input. However, we very well know that although almost all numbers are normal and transcendental proving them so is a problem, such is the problem here. Proving that the specific behaviour is chaotic can be done only in case to case basis.

  2. For the second phenomenon reported [12] : As stated clearly in [12] that same ML algorithm from a different subset of the original training set facing the same problem. This is happening because in this case ML iteration is actually converging, and generating a fractal partition of the input space (theorem 3.4). This is precisely what they found : input to output mapping mostly discontinuous . They also noted that for each input which gets accepted, there are dense (definition A.7) set of inputs which gets rejected. This is a standard property for fractal space [8].

    So, to summarise when the iteration converges, the result would ‘almost surely’ be a fractal set. Iteration from different initial conditions would, in fact converge to different indicator function, hence would accept different fractal sets . Albeit but also in fact the set should be dense (definition A.7) in .

    This clearly demonstrates the chaotic nature of the ML iteration convergence.

Finally we end with the same note they have : indeed, any form of computable machinery would exhibit behaviour of this kind. These chaotic behaviours are in effect Universal, as clearly demonstrated in this paper.

Appendix A Definitions Used

Definition A.1.

Fixed Point of a function.

For a function , is said to be a fixed point, iff .

Definition A.2.

Metric Space.

A metric space is an ordered pair where is a set and is a metric on , i.e., a function:-

such that for , the following holds:-

  1. iff .

The function ‘’ is also called “distance function” or simply “distance”.

Definition A.3.

Cauchy Sequence in a Metric Space .

Given a Metric space , the sequence of real numbers is called ‘Cauchy Sequence’, if for every positive real number , there is a positive integer such that for all natural numbers the following holds:-

Roughly speaking, the terms of the sequence are getting closer and closer together in a way that suggests that the sequence ought to have a limit . Nonetheless, such a limit does not always exist within .

Note that by the term: sequence we are implicitly assuming infinite sequence , unless otherwise specified.

Definition A.4.

Complete Metric Space.

A metric space is called complete (or Cauchy) iff every Cauchy sequence (definition A.3) of points in has a limit , that is also in .

As an example of not-complete metric space take , the set of rational numbers. Consider for instance the sequence defined by and function is defined by standard difference between , then :-

This is a Cauchy sequence of rational numbers, but it does not converge towards any rational limit, but to , but then .

The closed interval is a Complete Metric space which is homemorphic (definition A.8) to .

Definition A.5.

[3] [4] Orbit.

Let be a function. The sequence where

is called an orbit (more precisely ‘forward orbit’) of .

is said to have a ‘closed’ or ‘periodic’ orbit if .

Definition A.6.

Topological Space.

Let the empty set be written as : . Let denotes the power set, i.e. the set of all subsets of . A topological space is a set together with satisfying the following axioms:-

  1. and ,

  2. is closed under arbitrary union,

  3. is closed under finite intersection.

The set is called a topology of .

Definition A.7.

Dense Set.

Let be a subset of a topological space . is dense in for any point , if any neighborhood of contains at least one point from .

The real numbers with the usual topology have the rational numbers as a countable dense subset.

Definition A.8.

Homeomorphism.

A function between two topological spaces and is called a homeomorphism if it has the following properties:

  1. f is a bijection (one-to-one and onto),

  2. f is continuous,

  3. the inverse function is continuous (f is an open mapping).

Lemma A.1.

Existence of Homeomorphic functions on Partitions.

Let be a partition of homeomorphic to , such that:

then, there exists homeomorphic functions from .

Definition A.9.

Bounded Sequence.

A sequence is called a bounded sequence iff :-

The number ‘l’ is called the lower bound of the sequence and ‘u’ is called the upper bound of the sequence.

Lemma A.2.

Bolzano-Weierstrass.

Every bounded sequence has a convergent (Cauchy) subsequence.

It is to be noted that a bounded sequence may have many convergent subsequences (for example, a sequence consisting of a counting of the rationals has subsequences converging to every real number) or rather few (for example a convergent sequence has all its subsequences having the same limit).

Definition A.10.

Turing Machine.

A “Turing Machine” is a 7-tuple (), where:-

  1. is the set of states.

  2. is the set of input alphabets not containing the blank symbol .

  3. is the tape alphabet , where and .

  4. is the transition function.

  5. is start state.

  6. is the accept state.

  7. is the reject state.

According to standard notion , but we omit this requirement here, as we are not going to distinguish between two different types of halting (‘accept and halt’ vs ‘reject and halt’) of Turing Machines.

A Turing Machine ‘’ (definition A.10) computes as follows.

Initially ‘’ receives the input on the leftmost ‘’ squares on the tape, and the rest of the tape is filled up with blank symbol ‘’. The head starts on the leftmost square on the tape. As the input alphabet ‘’ does not contain the blank symbol ‘’, the first ‘’ marks end of input.

Once ‘’ starts, the computation proceeds wording to the rules of ‘’. However, if the head is already at the leftmost position, then, even if the ‘’ rule says move ‘’ , the head stays there.

The computation continues until the current state of the Turing Machine is either , or . In lieu of that, the machine will continue running forever.

Definition A.11.

Decider Turing Machine.

A Turing Machine, which is guaranteed to halt on any input (i.e. reach one of the states {} ) is called a decider.

Definition A.12.

Undecidable Problem.

If for a given problem, it is impossible to construct a decider (definition A.11) Turing Machine, then the problem is called undecidable problem.

Definition A.13.

Universal Turing Machine.

An ‘UTM’ or ‘Universal Turing Machine’ is a Turing Machine (definition A.10) such that it can simulate an arbitrary Turing machine on arbitrary input.

Definition A.14.

Church Turing Thesis.

Every effective computation can be carried out by a Turing machine (definition A.10), and hence by an Universal Turing Machine(definition A.13).

Definition A.15.

Gödelization (Gödel).

Any string from an alphabet set can be represented as an integer in base ‘’ with . To achieve this, create a one-one and onto Gödel map , where,

Gödelization or then, is defined as follows:

A string of the form , with , can be mapped to an integer as follows [11]:

The common decimal system is a typical example of Gödelization of symbols from . The binary system represents Gödelization of symbols from . As a far fetched example, any string from the whole english alphabet, can be written as a base 26 integers!

Definition A.16.

Rationalization.

Any string ‘’ of length ‘n’ () , created from an alphabet set , can be represented as a rational number . We define the rationalization, , in terms of Gödelization (definition A.15) as follows [11]:

By definition, .

Theorem A.1.

Bounded non repeating sequences are chaotic.

Suppose is an infinite sequence such that , and when , then is chaotic. Given a bounded sequence, it is almost surely chaotic [11].

Theorem A.2.

Universal Computation is ‘Almost Surely’ chaotic.

The rationalization of the sequences that is of a Universal Turing machine is a bounded sequence between and hence ‘almost surely’ chaotic (theorem A.1) [11].

References

  • [1] Vapnik, V. ,

    The Nature of Statistical Learning Theory (Information Science and Statistics)

    . Springer, Softcover reprint of hardcover 2nd ed. 2000 edition (October 21, 2010).
  • [2] Royden, H. L. , Real Analysis . Prentice Hall of India Private Limited, 2003.
  • [3] Arrowsmith, D.K. and Place, C.M., An Introduction to Dynamical Systems. Cambridge University Press.
  • [4] Brin, Michael and Stuck, Garrett., Introduction to Dynamical Systems. Cambridge University Press.
  • [5] Turing, A. M., On Computable Numbers, with an Application to the Entscheidungsproblem . Proceedings of the London Mathematical Society, 2(42): 230-265
  • [6] Feller, W. ,

    An Introduction to Probability Theory and Its Applications

    . Vol. 2, 3rd ed. New York: Wiley, 1968.
  • [7] Williams, David. , Probability with Martingales. Cambridge University Press, 1991.
  • [8] Heinz-Otto Peitgen, Hartmut Jürgens , Dietmar Saupe, Chaos and Fractals New Frontiers of Science. Springer.
  • [9] Smith,Henry J.S. On the integration of discontinuous functions. Proceedings of the London Mathematical Society, Series 1, vol. 6, pages 140-153, 1874.
  • [10] Cantor,George. Über unendliche, lineare Punktmannigfaltigkeiten V [On infinite, linear point-manifolds (sets)]. Mathematische Annalen, vol. 21, pages 545-591,1883.
  • [11] Mondal , Nabarun and Ghosh, Partha P , Universal Computation is ‘Almost Surely’ Chaotic . Theoretical Computer Science,2014 , doi:10.1016/j.tcs.2014.07.005
  • [12] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, Rob Fergus, Intriguing properties of neural networks . Arxiv http://arxiv.org/abs/1312.6199